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! Abstract 

'Approximate message passing' algorithms have proved to be effective in reconstructing sparse 
signals from a small number of incoherent linear measurements. Extensive numerical experiments 
further showed that their dynamics is accurately tracked by a simple one-dimensional iteration 
termed state evolution. In this paper we provide rigorous foundation to state evolution. We prove 
that indeed it holds asymptotically in the large system limit for sensing matrices with independent 
and identically distributed gaussian entries. 

While our focus is on message passing algorithms for compressed sensing, the analysis extends 
beyond this setting, to a general class of algorithms on dense graphs. In this context, state 
evolution plays the role that density evolution has for sparse graphs. 

The proof technique is fundamentally different from the standard approach to density evolu- 
tion, in that it copes with a large number of short cycles in the underlying factor graph. It relies 
instead on a conditioning technique recently developed by Erwin Bolthausen in the context of 
■^j- . spin glass theory. 

rn 

3 ' 1 Introduction and main results 

O ' 

Given an n x N matrix A, the compressed sensing reconstruction problem requires to reconstruct 
a sparse vector xq £ ^ N from a (small) vector of linear observations y = Axq + w G R n . Here w 
is a noise vector and A is assumed to be known. Recently |DMM09] suggested the following first 
$_i ' order approximate message-passing (AMP) algorithm for reconstructing xq given A, y. Start with 
an initial guess x° = and proceed by 

x t+1 = r ]t (A*z t + x t ), (1.1) 

z* = y - Ax 1 + \z 1 - 1 (ri_ 1 (A*z t - 1 + x^ 1 )) , 
o 

for an appropriate sequence of non-linear functions {t]t}t>o- (Here by convention any variable with 
negative index is assumed to be 0.) The algorithm succeeds if x l converges to a good approximation 
of x (cf. [DMM09] for details). 

Throughout this paper, the matrix A is normalized in such a way that its columns have £2 nornfl 
concentrated around 1. Given a vector x G M. N and a scalar function / : K — > R, we write f(x) for 
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the vector obtained by applying / componentwise. Further, 5 = n/N, (v) = N^ 1 X^j=i v i an d A* is 
the transpose of matrix A. 

Three findings were presented in [DMM09) : 

(1) For a large class of random matrices A, the behavior of AMP algorithm is accurately described 
by a formalism called 'state evolution' (SE). Extensive numerical experiments tested this claim 
on gaussian, Radamacher, and partial Fourier matrices; 

(2) The sparsity-undersampling tradeoff of AMP as derived from SE coincides, for an appropriate 
choice of the functions rjt, with the one of convex optimization approaches. Let us stress that 
standard convex optimization algorithms do not scale to large applications (e.g. to image 
processing), while the computational complexity of AMP is as low as the one of the simplest 
greedy algorithms; 

(3) As a byproduct of (1) and (2), SE allows to re-derive reconstruction phase boundaries earlier 
determined via random polytope geometry (see in particular [DT051 IDT09] and references 
therein). 

These findings were based on heuristic arguments and numerical simulations. In this paper we 
provide the first rigorous support to finding (1), by proving that SE holds in the large system limit, 
for random sensing matrices A with gaussian entries. Implications on points (2) and (3) will be 
reported in a forthcoming paper. 

Interestingly, state evolution gives access to sharp predictions that cannot be derived from random 
polytope geometry. A prominent example is the noise sensitivity of LASSO, which is investigated in 
[DMMlOcj . 

Note that AMP is an approximation to the following message passing algorithm. For all i,j £ [N] 
and a, b E [n] (here and below [N] = {1,2,..., N}) start with messages Xj_ i . a = and proceed by 

z a->i = Da — A a jXj_^ a , (1-2) 

je[N]\i 

X^a = Vt [ Y, AbiZ b^i ■ 
\be[n]\a J 

As argued in [DMMlOa] . AMP accurately approximates message passing in the large system limit. 
We refer to appendix [A] for an heuristic argument justifying the AMP update rules (jl.ip starting 
from the algorithm (|1.2j) . While this derivation is not necessary for the proofs of this paper, it can 
help the reader familiar with message passing algorithms to develop the correct intuition. 

An important tool for the analysis of message passing algorithms is provided by density evolution 
[RU08j . Density evolut ion is known to hold asymptotically for sequences of sparse graphs that are 
locally tree-like. The factor graph underlying the algorithm (jl.2p is dense: indeed it is the complete 
bipartite graph. State evolution can be regarded (in a very precise sense) as the analogue of density 
evolution for dense graphs. 

For the sake of concreteness, we will focus in this Section on the algorithm (jl.ip . and will keep 
to the compressed sensing language. Nevertheless our analysis applies to a much larger family of 
message passing algorithms on dense graphs, for instance the multi-user detection algorithm studied 
in [Kab03l INS051 IMT06] . Applications to such algorithms are discussed in Section [2j Section [3] 
describes an even more general formulation, as well as the proof of our theorems. Finally, Section [4] 
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describes a generalization to the case of symmetric matrices A that is directly related to the work of 
Erwin Bolthausen [Bol09] . 

It is important to mention that the algorithms (|l.ip and (jl.2p are completely different from 
gaussian belief propagation (BP). The gaussian assumption refers indeed to the distribution of the 
matrix entries, not to the variables to be inferred. More generally, none of the existing rigorous 
results for BP seem to be applicable here. 

It is remarkable that density evolution (in its special incarnation, SE) holds for dense graphs. 
This is at odds with the standard argument used for justifying density evolution so far: 'density 
evolution works because the graph is locally tree-like.' To the best of our knowledge, the approach 
developed here is the first one that overcomes the limitations of the standard argument (a discussion 
of earlier literature is provided in Section ll.4p . 

1.1 Main result 

We begin with some missing definitions for algorithm We assume 

y = Ax + w, (1.3) 

with w G M. n a vector with i.i.d. entries with mean and variance a 2 . In Section [3.21 we will show 
that the i.i.d. assumption can be relaxed to existence of a weak limit for the empirical distribution of 
w with certain moment conditions. Further, let {r]t}t>o be a sequence of scalar functions r\t ■ M. — > M 
which we assume to be Lipschitz continuous (and hence almost everywhere differentiable) . Define 
the sequence of vectors {x*}f>o, x 1 G R N , {.z'} 4 >o, z t G R n , through Eqs. (jl.ip . 

Next, let us define formally state evolution. Given a probability distribution px -, let Tg = 
a 2 + K{Xq}/5, and define recursively for t > 0, 

r t 2 +1 = a 2 + ^![ Vt (X + r t Z) - X ] 2 j , (1.4) 

with Xq ~ px and Z ~ N(0, 1) independent from Xq. We will use the term state evolution to refer 
both to the recursion (II. 4p (or its more general version introduced in Section f3.2|) and to the sequence 
{Tt}t>o that it defines. 

Let us denote the empirical distribution^] of a vector xq G M. n by p xo . Further, for k > 1 we say 
a function (J) : M m — > R is pseudo-Lipschitz of order k and denote it by (j) G PL (A;) if there exists a 
constant L > such that, for all x, y G M m : 

\<Kx) - 0(y)| < L(l + \\xf~ 1 + ht' 1 ) \\x - y\\ . (1.5) 

Notice that when (j) G PL(fc), the following two properties follow: 

(i) There is a constant L' such that for all x G M. m : \<j)(x)\ < L'(l + ||x|| fc ). 

(ii) (j) is locally Lipschitz, that is for any M > there exist a constant Lmtu < 00 such that for all 
x,y G [-M,M] m , 

\<p(x) - 4>(y)\ < L M , m \\x - - 
Further, L M , m < c[l + (M^) fc " 1 ] for some constant c. 

2 The probability distribution that puts a point mass 1/N at each of the N entries of the vector. 
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In the following we shall use generically L for Lipschitz constants entering bounds of this type. It 
is understood (and will not be mentioned explicitly) that the constant must be properly adjusted at 
various passages. 

Theorem 1. Let {A(N)}n>o be a sequence of sensing matrices A £ R nxAr indexed by N , with i.i.d. 
entries ~ N(0, 1/n), and assume n/N — )■ 6 € (0,oo). Consider further a sequence of signals 
{xq(N)}n>o, whose empirical distributions converge weakly to a probability measure px on R with 
bounded (2k — 2) th moment, and assume (JV) (Xq^ -2 ) — > E PX() (A~Q fc ~ 2 ) as N — > oo for some k > 2. 

Also, assume the noise w has iid entries with a distribution p\y that has bounded (2k — 2) th moment. 
Then, for any pseudo-Lipschitz function tj) : R 2 — > R of order k and all t > 0, almost surely 



1 - r i 

J im vE^ 1 ' 1 ".*) =E ^(m{Xo + T t Z),X Q ) , (1.6) 



1=1 



with Xq ~ px and Z ~ N(0, 1) independent. 

Up to a trivial change of variables, this is a formalization of the findings of [DMM09] (cf. in 
particular Eqs. (7), (8) and Finding 2 in that paper). 

As an immediate consequence of the above theorem we have the following decoupling principle 
implying that a typical (finite) subset of the coordinates of x l are asymptotically independent. 

Corollary 1 (Decoupling principle). Under the assumption of Theorem^ fix £ > 2, let if) : R 2 ^ — > R 

be any Lipschitz function, and denote by E expectation with respect to a uniformly random subset of 
distinct indices J(l), . . . , J{£) G [N]. 
Then for all t > 0, almost surely 

n™oo E ' • • • ' X J(.Q ' ' • • • ' x o,J(e)) = E {V>(^1, ^o,i, • • • , X 0>i ) } , (1.7) 

where Xi = ?7t_i(^o,'i + T t-\Z>i) for X 0i ~ px and Zi ~ N(0, 1), i = 1, . . . ,£ mutually independent. 
For the proof of this corollary we refer to Section 13.101 



1.2 Universality 

Our proof technique heavily relies on the assumption that A(N) is gaussian. Nevertheless, we expect 
the convergence expressed in Theorem Q] to be a fairly general result. In particular, we expect it 
to hold for matrices with i.i.d. entries with zero mean and variance 1/n, under a suitable moment 
condition. This type of universality is quite common in random matrix theory, and several arguments 
suggest that it should hold in the present case. For instance, it is possible to prove that state evolution 
holds for this broader class of random matrices when r/t(- ) is affine. Also, the heuristic argument 
discussed in the next section is clearly insensitive to the details of distribution of the entries. 

Numerical evidence presented in [DMM09 (wc refer in particular to the online supplement) 
suggests that state evolution might hold for an even broader class of matrices. Determining the 
domain of such an universality class is an outstanding open problem. 



1.3 State evolution: the basic intuition 

The state evolution recursion has a simple heuristic description, that is useful to present here since 
it clarifies the difficulties involved in the proof. In particular, this description brings up the key role 
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played by the last term in the update equation for z , that we will call the 'Onsager term', following 
|DMM09| . 

Consider again the recursion (jl.ip . but introduce the following three modifications: (i) Replace 
the random matrix A with a new independent copy A(t) at each iteration t; (ii) Correspondingly 
replace the observation vector y with y l = A(t)xQ + w; {Hi) Eliminate the last term in the update 
equation for z l . We thus get the following dynamics: 



x 



t+i 



vMityz' + x 1 ), (1.8) 

z f = y t - A(t)x t , (1.9) 



where A(0),A(1),A(2),... are i.i.d. matrices of dimensions n x N with i.i.d. entries Aij(t) ~ 
N(0, 1/n). (Notice that, unlike in the rest of the paper, we use here the argument of A to denote the 
iteration number, and not the matrix dimensions.) 

This recursion is most conveniently written by eliminating z t : 

x t+1 = 7*(A(t)y + (i - A(t)M(t))x*) , 

= i ]t (x + A(tyw + B{t)(x t -x )) , (1.10) 

where we defined B(t) = I - A(t)*A(t) E R NxN . Notice that this recursion does not correspond to 
any concrete algorithm, since the matrix A changes from iteration to iteration. It is nevertheless 
useful for developing intuition. 

Using the central limit theorem, it is easy to show that each entry of B{t) is approximately normal, 
with zero mean and variance 1/n. Further, distinct entries are approximately pairwise independent. 
Therefore, if we let rf = limjv->oo \\ xt — x o\\ 2 /N, we obtain that I?(i)(x* — xq) converges to a vector 
with i.i.d. normal entries with mean and variance Nt 2 jn = t 2 /5. Notice that this is true because 
A(t) is independent of {^4(s)}i< s <t_i and, in particular, of {x l — x°). 

Conditional on w, A(t)*w is a vector of i.i.d. normal entries with mean and variance (l/ra)||w|| 2 
which converges by the law of large numbers to a 2 . A slightly longer exercise shows that these entries 
are approximately independent from the ones of B(t)(x t — xq). Summarizing, each entry of the vector 
in the argument of rjt in Eq. fjl . lQf) converges to Xq + TfZ with Z ~ N(0, 1) independent of Xq, and 

r? = a 2 + l??, (1.11) 
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. Jim _|| x *_ Xo f. 

On the other hand, by Eq. (jl.lOp . each entry of x t+1 — xq converges to rj t (Xo + n Z) — Xq, and 
therefore 

ii = Jim h\x t+1 -x f =m{[ m (X +T t Z)-X } 2 }. (1.12) 

Using together Eq. (jl.lip and (|1.12|) we finally obtain the state evolution recursion, Eq. (jl.4|) . 

We conclude that state evolution would hold if the matrix A was drawn independently from 
the same gaussian distribution at each iteration. In the case of interest, A does not change across 
iterations, and the above argument falls apart because x and A are dependent. This dependency is 
non- negligible even in the large system limit iV — > oo. This point can be clarified by considering the 
iteration 



x t+1 = r ]t (A*z t + x l ) , (1.13) 

-y-Ax\ 
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with a matrix A constant across iterations. This iteration is the basis of several algorithms in com- 
pressed sensing, most notably the so-called 'iterative soft thresholding' [DDM04] . Such algorithms 
have been the object of great interest because of the high computational cost of standard convex 
optimization methods in large scale applications. 

Numerical studies of iterative soft thresholding |DM091 IDMM09] show that its behavior is dra- 
matically different from the one in Eq. (jl.ip and in particular state evolution does not hold for the 
iterative soft thresholding iteration (ll,13p . even in the large system limit. 

This is not a surprise: the correlations between A and x l simply cannot be neglected. On the 
other hand, adding the Onsager term leads to an asymptotic cancelation of these correlations. As 
a consequence, state evolution holds for the AMP iteration (|l.ip despite the fact that the matrix is 
kept constant. 

1.4 Related literature 

As mentioned, the standard argument for justifying density evolution relies on the locally-tree like 
structure of the underlying graph. This argument was developed and systematically exploited for the 
analysis of low-density parity-check (LDPC) codes under iterative decoding |RU08] . In this context, 
density evolution provides an exact tool for computing asymptotic thresholds of code ensembles based 
on sparse graph constructions. Optimization of these thresholds has been a major design principle 
in LDPC codes. 

The locally tree-like property is a special case of local weak convergence. Local weak convergence 
of graph sequences was first defined and studied in probability theory by Benjamini and Schramm 
[BS96] . and then greatly developed by David Aldous [AS03j . in particular to study the so called 
'random assignment problem' [AldOl] . Loosely speaking, local weak convergence allows to treat 
sequences of graphs of increasing size, such that the neighborhood of a node converges to a well 
defined limit object. 

The random assignment problem is defined as a distribution of random instances of the assign- 
ment problem on complete bipartite graphs. In particular, such graphs are not locally tree-like. 
Nevertheless, they admit a rather simple local weak limit (called the PWIT), which is a tree. The 
basic reason is that only a sparse subgraph of the complete bipartite graph is relevant for the mini- 
mum cost assignment, namely the one of edges with small cost. One concrete way to derive density 
evolution in this case is indeed to eliminate all the edges of cost larger than -say- A n /n with A n 
diverging slowly with the graph size n. The resulting graph is sparse and one can apply standard 
arguments (cf. |MM09] for an outline of this argument). A more sophisticated argument was pre- 
sented in |SS09] which nevertheless uses the existence of a non-trivial local weak limit, and the fact 
that only a sparse subgraph is relevant (Lemma 4.1 in [SS09]). 

This reduction to a sparse graph, and hence to a limit tree, is impossible in the class of algorithms 
studied in our paper: the algorithm iteration cannot be approximated by an iteration on a sparse 
graph (at least not on an instance- by-instance basis). This corresponds to the fact that no (simple) 
local weak limit exists in our case. The underlying graph is the complete bipartite graph with vertex 
sets [N] = {1,2,..., N} and [n] = {1,2,..., n}, and edge-weights A ai for all (a, i) G [n] x [N]. If we 
choose a node i £ [N] as root, its depth- 1 neighborhood consists of [n] node, each carrying a weight 
of order 1/y/n. Even this small neighborhood has no simple local weak limit. 

This difference is analogous to the difference between mean-field spin glasses (e.g. the Sherrington- 
Kirkpatrick model) and the random assignment problem [TallOt Chapter 7]. As a consequence, our 
proof does not rely on local weak convergence, and has to deal directly with the intricacies of graphs 
with many short cycles. 
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The theorem proved in this paper is not only relevant for [DMM09 but for a larger context 
as well. First of all, following the work by Tanaka |Tan02| . hundreds of papers have been pub- 
lished in information theory using the replica method to study multi-user detection problems. In its 
replica-symmetric version, the replica method typically predicts the system performances through 
the solution of a system of non- linear equations, which coincide with the fixed point equations for 
state evolution. The present result provides a rigorous foundation to that line of work, along with 
the analysis of a concrete algorithm that achieves those performance. Further, [GV05] insisted on the 
role of a 'decoupling principle' that emerges from the replica method, and on the insight it provides. 
Corollary [T] indeed proves a specific form of this decoupling principle. 

A more recent line of works uses the replica method to study typical performances of compressed 
sensing methods. Although non-rigorous and limited to asymptotic statements, the replica method 
has the advantage of providing sharp predictions. Standard techniques instead predict performances 
up to undetermined multiplicative constants. The determination of these constants can be of guidance 
for practical applications. This motivated several groups to publish results based on the replica 
method |RFG09t IKWT09 . GBS09]. The present paper provides a rigorous foundation to this work 
as well. 

2 Examples 

In this section we discuss in greater detail some of the applications of Theorem[T]to specific problems. 
To be definite, it is convenient to keep in mind a specific observable for applying Theorem [TJ If we 
choose the test function ip(x,y) = (x — y) 2 , we get almost surely 

lim - x \\ 2 = (r 2 - a 2 )5 . (2.1) 

TV— >oo iV 

Therefore state evolution allows to predict the mean square error of the iterative algorithm (jl.ip . 
More generally, state evolution can be used to estimate £ p distances for p < k through 

lim i-||i t -x ||S = E{[^-i(-Xo + rt_iZ)-Xo] 1 '}, (2.2) 

TV— !-oo iV r <• 

almost surely. 

2.1 Linear estimation 

As a warm-up example consider the case in which the a priori distribution of xq is gaussian, namely 
its entries are i.i.d. N(0,u 2 ). It is a consequence of state evolution that the optimal AMP algorithm 
makes use of linear scalar estimators 

7] t (x) = X t x . (2.3) 

Clearly, such functions are Lipschitz continuous, for any finite. The AMP algorithm (jl.ip becomes 

x t+l = \ t {A*z t + x l ), (2.4) 
z t = y -Ax t + (\ t - 1 /5) z*- 1 . 

State evolution reads 

r? +1 = a 2 + ^l-X t ) 2 v 2 + ^X 2 r 2 . (2.5) 
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Theorem Q] also shows that the empirical distribution of {(A*z t + x l )i — xo,i}j e [7v] is asymptotically 
gaussian with mean and variance t 2 . Hence, the optimal choice of Aj is 



v 2 



* = ^—3 • (2-6) 



V Z + T 



Notice that this also minimizes the right hand side of Eq. f)2 . 5[) . Under this choice, the recursion 
(1231) yields 

T t+1 = <T + - — — 2 • ( 2 - 7 ) 

The right hand side is a concave function of t 2 , and is easy to show that Tt — > exponentially fast, 
where, for c = (1 — S)/6, 

rl = IjV + cv 2 ) + vV + cv 2 ) 2 + Aa 2 v 2 } . (2.8) 

The mean square error of the resulting algorithm is estimated via Eq. (|2.ip . In particular, under the 
optimal choice of Xt, the latter converges to (r^ — a 2 )5 with Too given as above, thus yielding 

lim lim -J-lk* - ^o|| 2 = -{{-v 2 + cv 2 ) + J (a 2 + cv 2 ) 2 + 4a 2 v 2 } . (2.9) 

t— »oo JV— >oo iV 2 l J 

We recall that the asymptotic mean square error of optimal (MMSE) linear estimation was 
computed by Tse-Hanly and Verdu-Shamai in the case of random matrices A with i.i.d. entries 
[TH991 IVS99] . The motivation came from the analysis of multiuser receivers. The resulting MSE 
coincides with the value predicted in Eq. (|2,9p . thus showing that -in the linear case- the AMP 
algorithm is asymptotically equivalent to the MMSE estimator. 

Notice that the computation of the MMSE in I I 199 VS99] relied heavily on the Marcenko- 
Pastur law for the limit spectral law of Wishart matrices [MP67j . Conversely, any calculation of the 
MMSE as a function of the noise variance a 2 gives access to the asymptotic Stieltjis transform of the 
spectral measure of A. This suggests that state evolution is a non-trivial result already in the case 
of linear r]t(-), since it can be used to derive the Marcenko-Pastur law in random matrix theory. 

2.2 Compressed sensing via soft thresholding 

In this case the vector xq is I sparse (i.e. it has at most I non- vanishing entries). Assuming that the 
empirical distribution of xo converges to the probability measure px , it is also natural to assume 
t/N — » e as A r — )• oo with 

F{X ^0} = e. (2.10) 

(Indeed Theorem Q] accommodates for a more general behavior, since p X0 (N) i s only required to 
converge weakly.) 

In [DMM09"] . the authors proposed an algorithm of the form (jl.ip with r/t(x) = r](x; Ot) a sequence 
of soft-threshold functions 



r](x;9) 



' (x-9) ifx>6, 
if-6<x<e, (2.11) 

(x + 9) if x < -6. 



S 



The function x i— y r](x; 9) is non-linear but nevertheless it is Lipschitz continuous. Therefore Theorem 
[T] applies to this case, and allows to predict the asymptotic mean square error using Eqs. (II, 4h and 

m- 

This choice of the nonlinearity rjt is close to the optimal in minimax sense. Indeed, a substantial 
literature (see e.g. |DJ94bl lDJ94aj ) studies the problem of estimating the scalar Xq from the noisy 
observation 

Y = X + Z, (2.12) 

with Z ~ N(0, s 2 ). For an appropriate choice of the threshold 6 = 9(e,s), and e | (very sparse 
sources), the soft thresholding estimator was proved to be minimax optimal, i.e. to achieve the 
minimum worst-case MSE over the class (|2.10p . State evolution allows to deduce that the choice 
(|2.1ip yields the best algorithm of the form (|l.ip for estimating sparse vectors, over the worst-case 
vector x [DMM09] . 

It is argued in |DMM09"| IDMMlOcj . and proved in |BM10j in the case of gaussian matrices, that 
the asymptotic MSE of AMP coincides with the one of a popular convex optimization estimation 
technique, known as the LASSO. The above argument is suggestive of a possible way to prove 
minimax optimality of the LASSO. 

Finally, state evolution provides a systematic way of improving the choice of the non-linearities 
r\t when the class of signal changes. The basic idea is to choose the function rjt that minimizes the 
right-hand side of Eq. (jl.4p in minimax sense. This corresponds to constructing minimax MMSE 
estimators for the scalar problem (|2.12j) . For instance, the limit case in which the distribution of Xq 
is known, the MMSE estimator is simply conditional expectation, which leads to the choice 

r H (x)=E{X \X + T t Z = x}, (2.13) 

with Z ~ N(0, 1). In other words, the very choice of the non-linearities is dictated by the gaussian 
convergence phenomenon described in Theorem [TJ 

2.3 Multi-User Detection 

The model (|1.3j) is used to describe the input-output relation in code division multiple access (CDMA) 
channel. The matrix A contains the users' signatures. A frequently used setting for theoretical 
analysis consists in taking the large system limit with n/N — > 5 giving the spreading factor, and 
in assuming that the signatures (and hence A) have i.i.d. components. The entries xo,« belong 
to the signal constellation used by the system. For the sake of simplicity, we consider the case 
of antipodal signaling, i.e. xq^ G {+1, —1} uniformly at random. Other signal constellations can 
also be treated applying our Theorem [TJ The hypothesis that xq is independent from A is also 
standard in this context and justified by the remark that the transmitted information is independent 
from the signatures. Further, the source-channel separation theorem naturally leads to the uniform 
distribution. 

Following |Kab031 iNSOBj IMT06| we take 

rj t (x) =tanh{a;/T t 2 }. (2.14) 

The rationale for this choice is that it gives the conditional expectation of a uniformly random signal 
Xq £ {+1, — 1}, given the observation X$ + Tt Z = x for Z ~ N(0, 1) gaussian noise. This is therefore 
a special case of the rule (|2.13p and by the argument given there, it achieves minimal mean-square 
error within the class of algorithms (II. ip . 
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The algorithm (jl.ip reads in this case 

x t+1 =tanh{-^(^*z t + x*)}, (2.15) 

z* = y - Ax' + ^{l - (tanh 2 [(^V + x')/r 2 ]} } . 
State evolution yields 

r?+i =° 2 + \ E{ [tanh (rr 2 + r^) - l] 2 } . (2.16) 

This state evolution recursion was proved in M I ()(i| for properly chosen sparse signature matrices 
A. Theorem [T] provides the first generalization to the more relevant case of dense signatures. 

As mentioned in Section ll.4| Tanaka used the replica method to compute the asymptotic per- 
formance of a MMSE receiver. The expressions obtained through this method correspond to a fixed 
point of the recursion ()2.16p . It was further proved in |MT06| that, whenever the fixed point is 
unique, this prediction is asymptotically correct. For such values of the parameters, we deduce that 
the AMP algorithm is asymptotically equivalent to the MMSE receiver. 

Let us point out that, in a practical setting, it might be inconvenient to estimate the noise variance 
and/or to change the function rjt across iterations. Several authors (see for instance |Tan05| ) used 
the function 

rj t {x) =tanh{/3x} . (2.17) 
State evolution can be applied in this well (for any finite /3) and reads 

r t 2 +1 = a 2 + ^E{[tanh(/3 + /3r t Z) - l] 2 }. (2.18) 

On the other hand the case /3 — > oo is not covered by our Theorem [H since it corresponds to the 
discontinuous function r]t(x) = sign(x). 



3 Proof 

The proof is based on a conditioning technique developed by Erwin Bolthausen for the analysis of 
the so-called TAP equations in spin glass theory [Bol09j . Related ideas can also be found in [Don06] . 

In the next section, we provide a high-level description of the conditioning technique, by using 
a simpler type of recursion as reference. We will then introduce some new notations and state and 
prove a more general result than Theorem [TJ 

3.1 The conditioning technique: an informal description 

For understanding the conditioning technique, it is convenient to consider a somewhat simpler setting, 
namely the one of symmetric matrices. This will be discussed more formally in Section SJ Let 
G = A* + A where A £ R NxN has i.i.d. entries A^ ~ N(0, (27V)" 1 ). We consider the iteration 

h t+1 = Gm 1 - Atm'" 1 , (3.1) 
m l = fih 1 ) , (3.2) 
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for / : R — > R a non- linear function and m" 1 = 0. For the sake of simplicity, h° = 0. The correct 
expression for the scalar At is provided in Section [H and state evolution holds only if this value is 
used. On the other hand, this expression is not important for our informal discussion here. 

Consider the first iteration. By definition to -1 = 0, whence h 1 = G/(0) is a vector with i.i.d. 
gaussian components with variance ||/(0)|| 2 /iV. This follows in particular by the rotational invariance 
of the distribution of G, which implies that, for a deterministic vector v, Gv is distributed as ||«||Gei 
for e\ the first vector of the standard basis of R^ (see also Lemma [3] below) . 

Now consider the t th iteration (i.e., h t+1 = Gm 1 — Xtm t ~ 1 ). The problem in repeating the above 
argument is that G and f(m t ) are dependent. For instance f(m t ) might a priori align with the 
minimum eigenvector of G. More generally the problem is that G is not independent from the 
cr-algebra &t generated by {h°, h 1 , . . . , /i'}. 

The key idea in the conditioning technique is to avoid computing the conditional distribution of 
to* given G. We instead compute the conditional distribution of G given <3t- 

The next important remark is that &t contains {to , to , . . . , m*} as well. Conditioning on &t is 
therefore equivalent to conditioning on the event 

St = {h 1 + AV" 1 = Gm°, ...,h* + A'" 1 ™'" 2 = Gm 1 - 1 } . 

which is in turn equivalent to making a set of linear observations of G. 

At this point, the assumption that G is gaussian plays a crucial role. The conditional distribution 
of a gaussian random variable G given linear observations is the same as its conditional expectation 
plus the projection of an independent gaussian. In formulae: 

G\ 6t = G\ Et = E{G|© t } + P* ± G ncw Pi, 
with P*j_ an appropriate projector. If we write Et = E{G|©t}, we have 

Gm% t = G ncw (Pim*) - (J — P* ± )G ncw (Pim*) + £ 4 to* . 

We refer to the actual proof for a calculation of the various terms involved. 

Each of the above terms can be written explicitly as a function of the observed values {?n°, m 1 , . . . , to*} 
and of the new gaussian random variables G ncw . The first term G new (P* 1 TO t ) is clearly gaussian. The 
other terms are not. In order to control them, we will proceed by induction over t and use an ap- 
propriate strong law of large numbers for triangular arrays. The key phenomenon is that the only 
non-gaussian term that does not vanish in the large system limit cancels with the term — AtTO* -1 in 
recursion (|3.ip . thus implying the claimed gaussianity of h t+l . 

3.2 A general result 

We describe now a more general recursion than in Eq. (jl.ip . In the next section we show that the 
AMP algorithm (jl.ip can be regarded as a special case of the recursion defined here. 

The algorithm is defined by two sequences of functions {ft}t>o, {gt}t>o, where for each t > 0, 
ft '■ R 2 — > R and gt ■ R 2 — > R are assumed to be Lipschitz continuous. Recall that Lipschitz functions 
are continuous, and are almost everywhere continuously differentiable with a bounded derivative. As 
before, given a, b S R^ , we write ft{a, b) for the vector obtained by applying componentwise ft to a, 
b. When b is clear from the context we will just write, with an abuse of notation, ft(a). We will use 
analogous notations for gt. 
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Given w G R n and xq G R , define the sequence of vectors h l ,q l G R^ and z l ,m l G R n , by fixing 
initial condition g°, and obtaining {b t }t>o, {m t }t>o, {h t }t>i, and {g*}t>i through 

h t+1 = A^m'-^q 1 , m* = u;) , 

b t = Aq'-Xtm 1 - 1 , <f = f t (h t ,x ), (3.3) 

where £t = (g'^b 1 ,w)) , X t = |(//(/i*, x )) (both derivatives are with respect to the first argument), 
and we recall that -by definition- m~ l = 0. 
Assume that the limit 

a 2 ^ lim ±-\\?f (3.4) 

7V->oo JS 

exists, is positive and finite, for a sequence of initial conditions of increasing dimensions. State 
evolution defines quantities {r 2 }t>o and {of}t>o via 

r 2 = E{g t (a t Z, W) 2 } , of = J E{/*(r*_iZ, A ) 2 } , (3.5) 

where W ~ pw and Ao ~ px are independent of Z ~ N(0, 1). Further, recall the notion of 
pseudo-Lipschitz function for k > 1 from Section [1.11 We have the following general result. 

Theorem 2. Let {qo(N)} n>o and {A(N)}n>o be, respectively, a sequence of initial conditions and 
a sequence of matrices A G R nxAr indexed by N with i.i.d. entries Aij ~ N(0, 1/n). Assume 
n/N — > 5 G (0,oo). Consider sequences of vectors {xq(N),w(N)}^>q, whose empirical distributions 
converge weakly to probability measures px and pw on R with bounded (2k — 2) th moment, and 
assume: 



(i) lim N ^E Pxo(N) (X 2 k ~ 2 ) =E PX() (X 2 k ~ 2 ) 



< oo. 



(ii) lim^ooE^^- 2 ) = E pw (W 2k ~ 2 ) < oo. 



(Hi) lim N ^ OD E Pqo(N) (X 2k - 2 ) < oo. 
Then, for any pseudo-Lipschitz function ip : R 2 — > R of order k and all t > 0, almost surely 

1 N 



Af-i-oo N 

i=l 



1 ™ 

hm - J2i>(b t i ,w i )=EU(o- t Z,W)\, (3.7) 



n 

i=l 



where Xq ~ px and W ~ p^/ are independent of Z ~ N(0, 1), and cr t , T t are determined by recursion 

m 



3.3 Corollary of Theorem [2t AMP and Theorem Q] 

As already mentioned, the AMP algorithm (jl.ip is a special case of recursion f|3. 3|) . The reduction 
is obtained by defining 

/i <+1 = x - (A*z t + x l ) , (3.8) 

g* = x'-x , (3.9) 

6* = w-z*, (3.10) 

m* = (3.11) 
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The functions ft and gt are given by 

ft(s,x ) = T)t-\{xQ - s) - xq , 9t(s,w) = s - w , (3.12) 

and the initial condition is q° = —xq. 
Note 1. 

(a) Although the recursions (II. ID and (13. 3p are equivalent mathematically, only the former can be 
used as an algorithm. Indeed the recursion (|3,3p tracks the difference of the current estimates x l 
from xq, and is initialized using x° itself. The recursion (|3.3p is only relevant for mathematical 
analysis. 

(b) Due to symmetry, for each t, all coordinates of the vector h l have the same distribution (simi- 
larly for b l , q l and m t ). 

3.4 Proof of Theorem [1] 

First note that (|3.5p reduces to 

t 2 = a 2 + a 2 t = a 2 + - 5 e{ (r, t -i(X + n-iZ) - Xq)'}, 
with t§ = a 2 + (5 _1 E(X^). The latter follows from 

and Tq = a 2 + erg. Also, by definition, x t+1 = 7ft(A*fe* + x*) = r/ t (xo — h t+l ). Therefore, applying 
Theorem [2] to the function (/i*,#o,j) •->■ ip(r]t-i(xo,i ~ hj),xo t i) we obtain almost surely 

1 - 

J im TvrE ^0*^0,0 = E{V(r? t -i(Xo-T t _iZ) ) Xo)} , 

i=l 

with Z ~ N(0, 1) independent of Xq ~ Px , which yields the claim as Z has the same distribution 
as —Z. Note that since n is Lipschitz continuous, when ip belongs to PL(fc) then (/i*,xo,i) •->■ 
i>{r]t-i(xQ,i - hj),Xo t i) also belongs to PL(fc). 

3.5 Definitions and notations 

When the update equation for h t+1 in (I3.3D is used, all values of b°, . . . , 6*, m°, . . . , m t , h\ . . . , h l 
and g ,...,<7* have been previously calculated. Hence, we can consider the distribution of h t+1 
conditioned on all these known variables and also conditioned on xq and w. In particular, define 
&ti,ta to be the cr-algebra generated by 6°, . . . , m°, . . . , m tl_1 , h , . . . , /i* 2 , q°, . . . ,q t2 and xo 
and u;. The basic idea of the proof is to compute the conditional distributions 6*|e t t an d |© t+1 1 - 
This is done by characterizing the conditional distribution of the matrix A given this filtration. 

Regarding h l and b l as column vectors, the equations for 6°, . . . , b t ~ 1 and h 1 , . . . ,h t can be written 
in matrix form as: 

[h 1 + So<f\ti i + € 1 q 1 \---\h t + St-iq t - 1 ] 

V * • 

x t 

[b°\b 1 + \ 1 m°\---\b t ~ 1 +X^ 1 m t ~ 2 } 
*• .. • 



A* [m l . . 
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or in short Xt = A*Mt and Yj = AQt . Here and below we use vertical lines to indicate columns of 
a matrix, i.e. [ai|a2| • • • |afc] is the matrix with columns a\, . . . , ctfe. 

We also introduce the notation mjj for the projection of m t onto the column space of Mt and 
define = m* — ml Similarly, define g| and q*j_ to be the parallel and orthogonal projections of q t 

onto column space of Qt- In particular, let a = at = («o, ■ ■ ■ , «t-i) and [3 = f3 t = (/3o, • • • , ft-i) be 
the vectors (in M*) of coefficients for these projections, i.e., 

t-i t-i 
mj^cW, ?j (3.13) 

We will show in Section 13.91 (cf . Corollary [2]) that for any fixed t as N goes to infinity the quantities 
Pi's and Oj's have a finite limit. 

Recall that D* denotes the transpose of the matrix D and for a vector u £ M m : (u) = YliLi ^i/m. 
Also, for vectors u,v £ W 71 we define the scalar product 

(it,u) = — V]njvjj . 

771 

t=l 

Given two random variables X, Y, and a <r-algebra (3, the notations X\q = Y means that for any 
integrable function <fi and for any random variable Z measurable on (3, E,{(fi(X)Z} = K{(p(Y)Z}. In 
words we will say that X is distributed as (or is equal in distribution to) Y conditional on 6. In 

case & is the trivial a algebra we simply write X = Y (i.e. X and Y are equal in distribution). For 
random variables X, Y the notation X ! =' Y means that X and Y are equal almost surely. 

The large system limit will be denoted either as limjv-^oo or as h m ri->oo- It is understood that 
either of the two dimensions can index the sequence of problems under consideration, and that 
n/N —> 5. In the large system limit, we use the notation ot(l) to represent a vector in R* (with t 
fixed) such that all of its coordinates converge to almost surely as iV — > oo. 

Finally, we will use l^x^ to denote the d x d identity matrix (and drop the subscript when 
dimensions should be clear from the context). Similarly, nxm is used to denote the n x m zero 
matrix. The indicator function of property A is denoted by 1(A) or I4. The normal distribution 
with mean \x and variance v 2 is N(/i,u 2 ). 

3.6 Main technical Lemma 

We prove the following more general result. 

Lemma 1. Let {A(N)}, {qo(N)}]y, {xo(iV)}iV an d { w (N)}n be sequences as in Theorem^ with 
n/N — > 5 G (0,oo) and let {o~t,Tt}t>o be defined uniquely by the recursion (I3.5P with initialization 
o~q = lim n _ s . oo ((7 , q°). Then the following hold for all t £ N U {0} 

(a) 

t-l 

h t+1 \e t+1 , t =^2aih i+l + A*mi + Q t+ id t+1 (l) , (3.14) 

i=0 

t-i 

bt \e t , t = E + Ai & + MM*) , (3.15) 

i=0 
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where A is an independent copy of A and the matrix Qt (Mt) is such that its columns form an 
orthogonal basis for the column space of Qt (Mt) and QXQt = N hxt (M*M-t = n\t x t)- 



(b) For all pseudo-Lipschitz functions 4>h,<Pb '■ R i+2 — > R of order k 

1 N 

J im T T J2Mhl---,hl +1 ,x ,i) a ^E{<f> h (T Z ,...,TtZ t ,X )}, (3.16) 

N—toc iV — 
i=l 

1 n 

lim - J2Mbl---,biwi)=-E{ ( / )b (a Z ,...,atZt,W)}, (3.17) 



n 

i=l 



where (Zq, . . . , Z^) and (Zo, . . . , Z^) are two zero-mean gaussian vectors independent of Xq, W , 
with Z h Zi ~ N(0,1). 

(c) For all < r,s < t the following equations hold and all limits exist, are bounded and have 
degenerate distribution (i.e. they are constant random variables): 



lim (/i r+i , h s+i ) =■ lim (m r ,m s ) , (3.18) 

N~>oo n->oo 

lim (6 r ,6 s ) =' i lim (q r ,q s ). (3.19) 

n-voo o N-^oo 

(d) For all < r, s < t, and /or any Lipschitz function if : R 2 — >• R , i/ie following equations 
hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant 
random variables): 

lim (h r+1 ,ip(h s+1 ,x )) =■ lim {h r+1 ,h B+1 ){(f/(h B+1 ,x )), (3.20) 



W us \ i , J i -ls 



lim (o r ,^(o s ,n;)> = lim (6 r , o s ) (^(6 S , u;)) . (3.21) 

n— K>o n— »oo 

/fere ip' denotes derivative with respect to the first coordinate off. 
(e) For i = k — 1, i/ie following hold almost surely 

1 * 

lim sup ttE^ 1 ) 2 ^ 00 ' ( 3 - 22 ) 



i=l 



lim sup - ^(^) 2£ < oo. (3.23) 

n—^oo rt . 

(/) For all0<r < t: 

lim hh r +\q )^0. (3.24) 

N-+oo iV 

(g) For all < r < t and < s < i — 1 i/ie following limits exist, and there exist strictly positive 
constants p r and s s (independent of N , n) such that almost surely 



lim (q r ± ,q r ± }> p r , (3.25) 
N— >-oo 

lim (m^, ml) > ? s . (3.26) 
Note 2. Equations \3.20) and \3.21\) have the form of Stein's lemma \Stel2^ (cf. Lemma^in Section 

Eg;. 
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3.6.1 Proof of Theorem H 



Assuming Lemma Q] is correct Theorem [2] easily follows. To be more precise, Theorem [2] is a ob- 
tained by applying LemmaQJb) to functions (phiyo, ■■■ ,Vt, xo,i) = ip(yt, »0,i) and (pb(yo, ■■■ ,Ut, Wi) = 

i>(yt,wi). 

The rest of Section [3] focuses on proof of Lemma [TJ 



3.7 Useful probability facts 

Before embarking in the actual proof, it is convenient to summarize a few facts that will be used 
repeatedly. 

We will use the following strong law of large numbers (SLLN) for triangular arrays of independent 
but not identically distributed random variables. The form stated below follows immediately from 
[HT97| Theorem 2.1]. 

Theorem 3 (SLLN, [HT97j ). Let {X n j : 1 < i < n, n > 1} be a triangular array of ran- 
dom variables with (X Uj i, . . . ,X n)Tl ) mutually independent with mean equal to zero for each n and 
n _1 ^iLi IE|^ri.,i | 2+e < cn e l 2 for some < g < 1, c < oo. Then ~ Ya=i Xi, n — > almost surely for 
n — > oo. 

Note 3. Theorem 2.1 in \HT91 1 is stronger than what we state here. In Appendix W\ we show how 
Theorem follows from it. 

Next, we present an algebraic inequality that will be used in conjunction with Theorem [3l Its 
proof is provided in Appendix [Cl 

Lemma 2. Let u±, . . . ,u n be a sequence of non-negative numbers. Then for all e > the following 
holds 

n / n \ l + e 

i=l \i=l / 

Next, we present a standard property of Gaussian matrices without proof. 

Lemma 3. For any deterministic u S M. N and v 6 M. n with \\u\\ = \\v\\ = 1 and a gaussian matrix A 
distributed as A we have 

(a) v*Au = Z/yjn where Z ~ N(0, 1). 

(b) limn-^oo || Ait || 2 = 1 almost surely. 

(c) Consider, for d < n, a d-dimensional subspace W of W 1 , an orthogonal basis w\ , . . . , of 
W with || Wj|| 2 = n for i = l,...,d, and the orthogonal projection Pw onto W. Then for 

D = [w\\ . . . \wd], we have PyyAu = Dx with x £ M. d that satisfies: lim n _ s . 0O ||x|| =' (the limit 
being taken with d fixed). Note that x is Od(l) as well. 

Lemma 4 (Stein's Lemma |Ste72j ). For jointly gaussian random variables Z\,Z2 with zero mean, 
and any function <p : M — > R where ~E{(p' (Zi)} and E,{Znp(Z2)} exist, the following holds 

E{Znp(Z 2 )} = Cov(Z 1 ,Z 2 )K{i P '(Z 2 )}. 

We will apply the following law of large numbers to the sequence {xo(N), w(N)}n. Its proof can 
be found in Appendix ID. II 
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Lemma 5. Let k>2 and consider a sequence of vectors {v(N)}n>o, whose empirical distribution 
converges weakly to probability measure pv on R with bounded k th moment, and assume Ep„ (JV) (V k ) — > 
E Pv (V k ) as N ->■ oo. Then, for any pseudo-Lipschitz function if) : R — > R of order k: 

1 N 

iKoJvS^*) = E tW]- (3-27) 

i=l 

Next lemma is on reak convergence of Lipschitz functions and its proof is in Appendix ID. 21 

Lemma 6. Let F : R 2 — > R be Lipschitz continuous and denote by F'{x, y) its derivative with 
respect to the first argument at (x,y) £ R 2 . Assume (X n ,Y n ) is a sequence of random vectors in R 2 
converging in distribution to the random vector (X,Y) as n — >■ oo. Assume further that X and Y 
are independent and that the distribution of X is absolutely continuous with respect to the Lebesgue 
measure. Then 

lim K{F'(X n , Y n )} = E{F'(X, Y)} . (3.28) 

n— yoo 

It is useful to remember a standard formula for the conditional variance of gaussian random 
variables. 

Lemma 7. Let (Z\, . . . , Zt) be a normal random vector with zero mean, and assume that the covari- 
ance matrix of (Z%, . . . , Zt—\) (denoted by C) is invertible. Then 

Var(Z t |Zi, . . . , Z t -i) = E{Z t 2 } - vTC^u , 

where u E R i_1 is given by ui = K{ZtZi}. 

An immediate consequence is the following fact, proven in Appendix ID. 31 

Lemma 8. Let Z\,. . . ,Z% be a sequence of jointly gaussian random variables and let c\,... ,c% be 
strictly positive constants such that for all i = 1, . . . ,t: X&r(Zi\Zi, . . . ,Zi_i) > Cj. Further assume 
E{Zf} < K for all i and some constant K. Let Y be a random variable in the same probability space. 

Finally let I : R 2 — > R be a Lipschitz function, with z i— > £(z, Y) non-constant with positive 
probability (with respect to Y). 

Then there exist a positive constant c' t (depending on c\, . . . ,Ct, on K , on the random variable Y , 
and on the function I ) such that 

E{[i(Z t ,Y)] 2 }-u*C~ 1 u>c' t , 

where R*" 1 is given byu { = K{£{Z t ,Y)e(Zi,Y)}, andC G R'"^*" 1 satisfies dj = E {£(Zi,Y) £(Zj,Y)} 
for 1 < i, j < t — 1. 

3.7.1 Linear algebra facts 

It is also convenient to recall some linear algebra facts. The first one is proved in Appendix ID. 41 
Lemma 9. Let v i, . . . , vt be a sequence vectors in R n such that for all i = 1, . . . , t: 

-\\Vi - Pi-l(Vi)\\ 2 >c 

n 

for a positive constant c and let Pj_i be the orthogonal projector to the span of v±, . . . Then 
there is a constant c' (depending only on c and t), such that the matrix C S R* x * with C%j = (vi,Vj) 
satisfies 

Amin(C) ^ C . 
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The second one is just a direct consequence of the fact that the mapping S h-> A m i n (5) is contin- 
uous at any matrix S that is invertible. 

Lemma 10. Let {<S n } n >i be a sequence of t x t matrices such that liminfn^oo A m i n (S' n ) > c for 
a positive constant c. Also assume that lim n ^. 0O S n = Soo where the limit is element-wise. Then, 

Amin('S'oo) — C. 

3.8 Conditional distributions 

In order to calculate o*|g t4 and h t+1 \<g t+lt we will characterize the conditional distributions ^4|e it 
and A\ 6t+lt . 

Lemma 11. For (ti , £2) = or (£i, £2) = (t + l,t), the conditional distribution of the random 

matrix A given the a -algebra &tit2> satisfies 

A\6 tl , t2 =E tl , t2 +V tut2 (A). (3.29) 

Here A = A is a random matrix independent of &ti,t 2 an d Et Xt t 2 = ^{Al&tifa) is given by 

E tu t2 = YtAQiQt^Qt + M t2 (M t * 2 M t2 )- l X; 2 - M t2 (M: 2 M t2 )- l M* t2 Y tl {Q* ti Q ti r l Ql . (3.30) 

Further, Vt-ifo is the orthogonal projector onto subspace Vtife = {Aj^Q^ = 0,A*Mt 2 = 0}, defined 
by 

Here P^~ It = I — Pu t2 > Pn t = ' ~~ ^Qt 1 > an< ^ P Qt x > P M t2 are orthogonal projector onto column spaces 
of Qti and M t2 respectively. 

Recall the following well-known formula. 

Lemma 12. Let z € W 1 be a random vector with i.i.d. N(0,i> 2 ) entries and let D S W nxn be a linear 
operator with full row rank. Then for any constant vector b £ M. m the distribution of z conditioned 
on Dz = b satisfies: 

z\Dz=b = D^DD*)-^ + P {D ,=o}(^) 

where P| D ^ =0 } is the orthogonal projection onto the subspace {Dz = 0} and z is a random vector of 
i.i.d. N(0,u 2 ). Moreover, D*(DD*)~ 1 6 = argmin^ {||z|| 2 |Dz = b] . 

Proof. The result is trivial if D = [l m xm|O mx (n-m)]- For general D, it follows by invariance of the 
gaussian distribution under rotations. Finally, using a least square calculation, it is simple to see 
that D^DD*)- 1 ^ = argmin^{||z|| 2 |Dz = b}. □ 

Lemma [TTI follows from applying Lemma [12] to the operator D that maps A to (AQ, M* A). A 
detailed proof of Lemma [TT] appears in Section 13.8.11 Note that we can assume, without loss of 
generality /, g to be non-constant as a function of their first argument. If this is the case, it is easy 
to see that, for finite values of t, the matrices M£Mt and Q\Qt are non-singular almost surely, and 
hence the above expressions are well defined. 

Lemma 13. The following holds 

Et+^m' = XtiMZM^Mtml + Qt+iiQt+iQt+i^Y^m^, (3.31) 
Etrf = Y t (Q* t Q t )~ 1 Q* t q\ + M t (M* M t )~ l X* q\. (3.32) 
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On the other hand let m| = E*=o «i m * = M *«- Then usin S A * M t = x u Eq. (|3T29j) . and 
[7 3 t+ i )t (i)]*?n{ | = we have, 

*£n,t™| = Q t+ i(QtViQt + i)- 1 ^ + im| | +X t (M;M t )- 1 M;mj - Q^ 1 (Q t * +1 Q m )- 1 F/ +1 mj 
= X t (M i *M i )- 1 M i *?n{ | . 

Similarly, writing = + q^, qt = Q t /3, and using X£Q t = M£AQ t = M£Y t , Q* t q\ = we obtain 
(13321) . □ 

3.8.1 Proof of Lemma 1111 

Conditioning on &tit 2 is equivalent to conditioning on the linear constraints AQt x = and A*Mt 2 = 
X t2 . To simplify the notation, just in Section [3.8-H we will drop all sub- indices t%, £2- The expression 
(|3.30p for the conditional expectation E = E{A|6 tl f 2 } follows from Lemma [121 and the following 
calculation for 



where ||^4||f denotes the Frobenius norm of matrix A. We use Lagrange multipliers method to obtain 
this minimum. Consider the Lagrangian 



with G ]R nx ' 1 , T G R Nxt2 and (A,B) = Tr(AB*) the usual scalar product among matrices. 
Imposing the stationarity conditions yields 



Equation (13. 33ft does not have a unique solution for the parameters O and T. In fact if Qq, To are 
a solution then for any ti x t\ matrix R the new parameters Q R = 0q + MR and = Tq — QR* 



satisfy @ R Q* + MT R = Q Q* + MF* = 2A. In particular for R x = T*Q(Q*Q)~ l we have Q*T Rl = 0. 
Multiplying ([3T33]) by Q from right (using Q Rl ,T Rl ) we have 2Y = @ Rl Q*Q or Q Rl = 2Y(Q*Q)- 1 . 
Now multiplying (1333]) by M* from left we obtain 2X* = 2M*Y(Q*Q)~ 1 Q* + M* MT* Ri which leads 
to T* Rl = 2(M*M)~ 1 [X* - M*Y(Q*Q)~ 1 Q*} . From these we see that E = E(A\6 tl ,t 2 ) satisfies: 




l(a, e, r) = pill + (e, (y - aq)) + (r, (x - a*m)) 



2A = @Q* + Mr* 



(3.33) 



E = Y(Q*Qy 1 Q* + M(M*My 1 X - M (M* M)" 1 M*Y (Q* Q)^ 1 Q* . 



Now we are left with the task of proving that / Pt 1 ,t 2 (^.) = Pm-APq. We need to show that the linear 
operator F : A 1— > P^APq satisfies 

(a) FoF = F. 



(b) F{A) eV = {A\AQ t 



0,A*M t2 =0}. 



(c) F{A) = A for A e V 
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(d) J- is symmetric. That is for all matrices A, B: (F(A),B) = (A,J-(B)). 

Now we check (a)-(d): 

(a) is correct since 

T o F{A) = P^PhAP^P^ = PhAP^. 

(b) is correct since by definition of J C (A)Q = P^APqQ = and similarly F(A)*M = 0. 

(c) follows because 

F{A) =A- P M A - AP Q + P m APq, 

and each of the last three term vanishes either because AQ = or because A*M = 0. 

(d) is correct because 

= Tv(AP^B*P^)={A,T{B)). 

3.9 Proof of Lemma CD 

The proof is by induction on t. Let H t +i be the property that flgJgD , (EH!]), flSHHD , (ET20]) . (ET22]) . 
(^24]1 . and (|3T25jl hold. Similarly, let B t be the property that (1X151) . (|3T7jl . (l3T9j) . (ET2"Tj) . (j333jl 
and (|3.26p hold. The inductive proof consists of the following four main steps. 

1. B holds. 

2. Hi holds. 

3. If B r , T~L S hold for all r < t and s < t then Bt holds. 

4. If B r , % s hold for all r < t and s < t then %t+i holds. 

For each of these steps we will have to prove several properties that we will denote by (a), (6), (c), 
(d), (e) and (g) according to their appearance in Lemma [TJ For % we also need to prove a property 

(/)• 

It is immediate to check that our claims become trivial if x i — >- ft(x,Xo) is constant (i.e. inde- 
pendent of x) almost surely (with respect to X$ ~ Px ), or if x \— > gt{x, W) is constant almost surely 
(with respect to W ~ pw)- We will therefore assume that neither of these degenerate cases hold. 

3.9.1 Step 1: B 

Note that 6° = Aq°. 

(a) ©o,o is generated by xq, q° and w. Also q° = since Qq is an empty matrix. Hence 

fc°l6o,o = Ml- 

(b) Let (f>b : M. 2 — > ]R be a pseudo-Lipschitz function of order k. Hence, |</>&(x)| < L(l + ||x|| fc ). Given 
q°, w, the random variable Ya=i ^ftd^? ]^ w i)/ n is a sum of independent random variables. 
By Lemmata) [Ag ]^ = Z||q ,0 ||/y / n for Z ~ N(0, 1). Hence, using 

lim (q°,q°) = <5cTq < oo, 
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for all p > 2 there exist a constant c p such that E{|(^g°)i| p } = (q°, q°) p / 2 E\Z\ p < c p . There- 
fore, in order to check conditions of Theorem [3] for X n ^ = ^(b^ , Wi) — EA{4>b{bi, Wi)} for a g 
in the interval (0, 1), 



1 n 1 n 

71 i=l 71 i=l 

1 " 

i=l 
1 " 

^ - E | E A,A Wi) - &([V]i, Wi)} 

i=l 

< - 2 | e a,a {I W]i " + H*" 1 + \[Aq%\ k - 1 + |[V]i|* -1 )} 



|2+e 



|2+e 



2+e 



< ( / + ^y| Wi |(*-i)(2+e) (3.34) 
n r— ' 

where A is an independent copy of A. Now using Lemma [2] for iij = |wj| 2 ^ _1 ^ and e = £>/2 we 
have 



< ^|^.|2(fe-l) 
i=l \i=l / 

which combined with Eq. f|3.34|) leads to 



n / n 

-VE|X nii | 2 ^< C ' + LW/ 2 -V 
n \ n ' 

i=l \ i=l 

/// p/2 



l+Q/2 



< c"n k 

The last inequality uses assumption on empirical moments of w. Therefore, we can apply 
Theorem [3] to get 



1 

lim - V [Mblm) -MA&biblwi)}] = 0. 

n— »oo n — » L J 

i=l 

Hence, using Lemma [5] for v = w and for ip(wi) = ¥.z{<pb(\\Q°\\Z/V™i w i)} we S e * 

1 ™ 

lim -S2E A [Mb i,Wi)} =-E{M*oZ,W)}. 



n— >oo IT, 

1=1 



Note that ^ belongs to PL(/c) since 0& belongs to PL(fc). 
(c) Using Lemma O conditioned on g°, 

lim (6°, 6°) = lim a J- lim = ff 2 

n.->oo n— >oo n 7V->oo 
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(d) Using Bo(b), and <ft(x,Wi) = x(p(x,Wi) we obtain lim n _ J . oo (6 , <p(b°, w)) a = ~E{aoZ(p(<JoZ , W)}, 
which is equal to <JqE{(p'(<joZ,W)} using LemmalU Note that xtp belongs PL(k). 

By part (b), the empirical distribution of (6°, it?) (i.e. the probability distribution on M? that 
puts mass 1/non each point (6°, Wi), i € [n]) converges weakly to the distribution of (o"o^> W). 
Using Lemma El we get lim n ^. oo ((/9 / (6 , w)) a = E{(//(<7o.Z', TU)}. 

(e) Similar to (6), conditioning on g°, the term ^21 = i([Aq°]i) 2e /n is sum of independent random 
variables (namely, gaussians to the power 21) and E{| L4g°]j| p } = (q°, g°) p / 2 E{Z p } < c for a 
constant c. Therefore, by Theorem [31 we get 



lim -j2\([A q %r-E A {([Aq%r} 



n 

i=l 



=■ 0. 



But, ±U=l^A{([Aq%) 2i } = (q°,q°yE z (Z^) < oo. 
(g) This case is trivial since there is not m s with s < t — 1 = — 1. 

3.9.2 Step 2: fti 

Note that h 1 = A*m° - £ <?°. 

(a) ©i^ is generated by xq, q°, w, b° and m°. Also m° = m*[ since Mo is an empty matrix. 
Applying Lemma [UJ we have 

Hence, 



•» lei, = ^ m u + <5 g u - CoQ- 

But using fio(d) for tp = go 

(q°,q°) 



lim (6°,m°) = lim (b° , g Q (b° , w)) ^ lim {b°,b ){g' (b ,w)) = lim 
Therefore, 

h^^PjoA^ + d^q . 
Also Bo(b), applied to the function 4>b(x,w) = go(x,w) 2 gives 

lim (m°, m°) a = E[ 5o (<t Z, TU) 2 ] = r 2 < oo . (3.35) 

Thus, 

P^A*m° = A*m° - P q0 A*m° = i*m° + oi(l)g° , 
where the last estimate follows from Lemma El^c) and (|3.35|) . Finally, 

h x \ 6lfi =i*m° + oi(l)g°. (3.36) 

(c) Using (|3.36p . (|3.35p . and Lemma [3l we get 

v jt.1 , l\i d t ||^4* m o|| 2 a.s. ,. , o 0\ a - s - 2 

hm (/i ,h > 6l0 = lim = hm (m ,m ) = r . 

iV— >oo ' JV— >oo iv iV— >oo 



22 



(e) First note that, conditioning on ©1,0, 

By assumption, lim7v->oo jr J2iLi(li) 2e < 00 an d finiteness of S^i([^* m °]i) 2 ^ can be estab- 
lished similar to Bo(e) for the sum of functions of independent gaussians ^2iL\{[A* m°]i) 2i / N . 

(/) Using (|3.36p and Lemmata) we have almost surely 

.1 „0\ d ,. Z\\m \\\\q°\\ a^. Z 



lim (/i\g ) = lim -J =' lim — J(m°, m )(q°, q°) a = . 

(b) This proof uses again Eq. (|3.36|) and is very similar to the proof of Bo(b). First we need to 
control the error term oi(l)q° = oi(l)q° . In other words we need to show 

v 



S^oAtE [<t>h {[A*m% +o 1 (l)g?,x ,i) " ([i*m°]i, *o,i 



N . . 

i=l 



a= s. Q _ 



To simplify the notation let a* = (|/l*77J ]i + oi(l)^,xo,i) and q = ([i*m°]j, io,i)- Now, using 
the pseudo-Lipschitz property of 

|0ft(ai) - M<*)\ < L{1 + maxdl^f- 1 , Hqf" 1 )} 
Using Cauchy-Schwartz inequality, 

TV / N N \ 1 I 2 

1 £ \M<H) - <M«)\ < imax 1 £ INI 2 *" 2 , ^ E H^ll 2 ^ 2 <*°' ^°) V2 • 

i=l \ i=l i=l / 

Hence, we only need to show i X)i=i ll a «l| 2fc ~ 2 < 00 an d ^ X)i=i ll c «ll 2fc ~ 2 < 00 as iV — )• 00. 
But 



JV / JV w 

^Eiwi 2i - 2 = o ^Ei^ + ^E 

i=l \ i=l i=l 



which is bounded using part (e) and the original assumption on xq. Similarly, using YliLi \\Qi\\ 2> 
00, we obtain Yli=i \\ c i\\ 2k ~ 2 < 00 • 

Thus, from here we consider h \q 10 = A*m° whose components are distributed as ||m°||Z'/Y / n 
for Z a standard normal random variable, and will follow the steps taken in Bo(b). Conditionally 
on ©1,0, we can apply Theorem [3] to get 

v 



I— ¥00 IX 1 — * L 



i=l 



a =o. 



Note that a similar inequality to (j3.34p can be obtained here as well and then the condition of 
Theorem [3] follows using Lemma[2]and the assumed bound on empirical moments of xq. Then, 
using LemmaOfor v = xq and ^(xo,i) = ^{^(^l j x o,i)}, we obtain 



1 n f ||m°|| 

i=i K v 



V 

The last equality used Bq(c 



0) a = E{MroZ,X )}. 
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(d) Using %i(b) for <j) h (x, x 0ii ) = xip(x,x ,i) we obtain lim A r^ 00 (/i 1 , ip{h} , xq)) =' E{t ^(t Z, X )}, 
which is equal to Tq K{ip'(tqZ, Xq)} using Lemma HI On the other hand, in proof of (6) we 
showed that lim^v_ ) . 00 (/i 1 , h 1 ) a = Tq. 

By part (6) the empirical distribution of (h ,xq) (i.e. the probability distribution on M 2 that 
puts mass 1/N on each point (foJ,iro,i)> * e [-^1) converges weakly to (tqZ,Xq). By applying 
Lemma[6]to the Lipschitz function tp, we get limjv-^oo^'^ 1 ) xq)) =' F*{(p'(tqZ,Xq)}. 

(g) Since t = 0, and = g° then the result follows from (j3.4|) and that <7q > 0. 



3.9.3 Step 3: B t 

This part is analogous to step 1 albeit more complex. 
First we prove (g). 

(g) Note that using induction hypothesis £>j_i(6) for 0&(6£, 6f, tOj) = g r (bl,Wi)g s (bf,uii), < r, s < 
i — 1 we have almost surely 



lim (m r ,m s > = E { 5r (a r Z r , VF) 5s (a s Z s , VF)) 

On the other hand 



m 



n 



Ml_ x M t - X 



n 



Ml_ x m 



t-i 



n 



(3.37) 



(3.38) 



But using induction hypothesis, we have lim n _ 5 . 00 (m^ , m r ± ) > <j r > for all r < t — 1. So using 
Lemma [H for large enough n the smallest eigenvalue of matrix M^_ 1 Mt^i/n is larger than a 
positive constant c' that is independent of n. Hence, by Lemma [10] its inverse converges to an 
invertible limit. Thus, Eqs. (|3.37p and (|3.38p lead to 



lim (m'" 1 ,™'- 1 ) ^ Eiigt^at^Zt^, W)] 2 } - vfC^u 



(3.39) 



with u G R^" 1 ) and C € rC*" 1 )*^- 1 ) such that for 1 < r, s < t - 1: 

u r =E|g r _i(cr r _iZ r _i, W) ^-1(^-1-^-1, W)j , C rs = E |5 r _i(cr r _iZ r _i, W) 5 s _i(cr s _iZ s _i, 

Now the result follows from Lemma [8] provided that we show for gaussian random variables 
ctqZq, . . . , at-iZt-i, all conditional variances V&r[a r Z r \ uqZq, . . . , cr r _iZ r _i] are strictly positive 
for r = 0, . . . ,t — 1. To prove the latter first using the induction hypothesis Bt-i(b), we have 
for all < r < t - 1 



lim (61,61) = lim (6 r ,6 r ) 



{b r )*B r 



n 



B*,B r 
n 



1 -1 



Btb r 



n 



\ai\<r r Z r I <jqZq, . . . , a r —\Z r —i\ 



Similar as above we used the fact that for large enough n the matrix B*.B r /n has a smallest 
eigenvalue greater than a positive constant to obtain the limit of its inverse. On the other hand 



24 



using induction hypothesis B r (c) we have almost surely 



lim (b r ± ,b r ± ) = lim (b r ,b r 



(b r )*B r 



B*B r 



n 



-i 



Blb T 



- lim (q r ± ,q r ± ) 
N— >oo 



(q r )*Qr 
N 



Q r Qr 

N 



QW 

N 



(3.40) 



And now by induction hypothesis T-L r (g) we have limjv-5.oo(<?l 5 Q±) > Pr- Hence the result 
follows. 



Corollary 2. The vectors 

a = (a , • • • , OLt-i) '- 
/3 = (/3 ,...,A-i) = 

have finite limits as n and N converge to oo. 



M*M t 
n 

QtQt 

N 



1 1 Ml m t 
n 



1 QU* 

N 



Proof. We can apply Lemma [9] to obtain that for large enough n the smallest eigenvalue of 
Mi I n is larger than a positive constant d . Hence by Lemma [10] its inverse has a finite 
limit. Similarly, we can apply induction hypothesis Ht(g) and Lemmas [9l and [TOl to the matrix 
QtQ t /N. □ 

(a) Recall definition of Yj and Xt from Section 13.51 

X t = H t + Q t E t , Yt = B t + [0\M t - 1 ]A t , (3.41) 
where E t = diag(£o, . . . , H t = [h l \ ■ ■ ■ \h% B t = [b°\ ■ ■ ■ {b^ 1 }, and A t = diag(A , . . . , A t _i). 

Lemma 14. The following holds 

(a) h t+1 \e t+1 , t = H t (M t *M t )- l M t *m\ + P^ t+ A* P^m* + Q t o t (l). 

(b) b% ttt 4 B t {Q1Q t y l Q* t q\ + PifAP^qt + M t o t {\). 

Proof. In light of Lemmas QT] and Q21 we have 

h t+1 \ 6t+ut 4 X t ( M t * M 4 ) ~ 1 M * m| + Qm(Q£fiQm) -1 ^t+i m i + P^i^ro* - 
= ^(Qt*Q0 -1 Q?9j + MtiMlMt^X^qi + P^iP^ - 
Now using ()3.4ip . we only need to show 

Q t ~ t (MlM t )- l Mlm\ + Qt+^Q^Qt^Y^mi - = Q t o t (l), 
[0\M t ^ 1 ]A. t (QtQt)~ 1 Qt4 + MtiMlMt^Xtqi - W" 1 = M t o t (l). 
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Recall that m| = M t a and qt = Qt(3- On the other hand Y t ^. 1 m t L = i? t * +1 m^ because M t *m^_ = 
0. Similarly, X^q 1 ^ = H^q 1 ^. Hence we need to show 

Q t Z t a + Qt+^Q^Qt+^B^mi - t t q l = Q t 6 t {\) (3.42) 
[0|M t _i]A*£ + M t {M* t M t r l H* t q^ - W" 1 = M t o t (l). (3.43) 

Here is our strategy to prove (|3.43p (proof of (|3.42p is similar) . The left hand side is a linear 
combination of vectors m°, . . . , m . For any I = 1, . . . , t we will prove that the coefficient of 
m/ -1 G M. n converges to 0. This coefficient in the left hand side is equal to 



[(MfM t )~ 1 H*q t L ] £ - Xti-Pt) 1 ** = £ 



n 



r=l 

To simplify the notation denote the matrix M^Mt/n by G. Therefore, 

t-i 

N— too N^oo 



ft t-l 
lim Coefficient of m^ 1 = lim <^ V (G" 1 ^ Jh r , q l - V p s q s }- - XA-fa 



. r=l s=0 



But using the induction hypothesis Ht(d) for ip = /i,...,/t, and Ht(f), the term (/i r ,g* — 
X]s=o fisq s ) is almost surely equal to the limit of (/i r , /i*) At — X^s=o Ps{h r , h s )X s . This can be 
modified, using the induction hypothesis Ht(c), to (m' r_1 , m t ~ 1 )\ t — ^ t s Z} (3 s (m r ~ 1 ,m s ~ 1 )\ s 
almost surely, which can be written as G r jXt — 2^s=o PsG riS X s - Hence, 



t-l 



lim Coefficient of m^ 1 =' lim { V (G _1 )£ r [G r t A t - V /3 s G r S A S ] - A^-ft) 1 ^* 



. r=l s=0 
t-l 



a.s. 



lim { A t I t= , - y /3 s \ s I i=s - Xti-Pt) 1 ** 





s=0 



Notice that the above series of equalities hold because G has, almost surely, a non-singular 
limit as N — > oo as shown in point (g) above. 

Equation (|3.42p is proved analogously, using ^ = {g'(b t , w)). □ 
The proof of Eq. ()3. 15|) follows immediately since the last lemma yields 

6 *le M = Yl PiV + Aqi - MtiMtM^MfAqi + M t o t (l) . 



i=0 



Note that, using Lemma [3jc), as n, N — > oo, 

MtiMfMt^MlAqi = M t o t (l) 
which finishes the proof since MtOt(l) + Mfdt (1) = Mtdt(l). 
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(c) For r, s < t we can use the induction hypothesis. For s = t, r < t, we can apply Lemma [Lil to 
b l (proved above), thus obtaining 



t-i 



t-i 



i=0 



i=0 



Note that, by induction hypothesis Bt-i(d) applied to if = gt-i, and using the bound Bt-i(e) 
to control {b l ,b r ), we deduce that each term (m l ,b r ) has a finite limit. Thus, 



t-i 



lim y o (l)(m> r ) a = 0. 

n— >oo t—J 



i=0 



We can use Lemma [3] for {P^j Aq^V) = (Aq t i _,P^b r ) (recalling that A is independent of 
q t L ,P^b r ) to obtain 



A„t r>± ur\ A IkillH-^Mt^ll Z a.s. 



N 







n 



where the last estimate uses the induction hypothesis Bt-i(c) and 7it(c) which imply, almost 
surely, for some constant c, (P^ It b r , P^ It b r ) < (b r ,b r ) < c and (<Zj_, <?j_) < < c for all iV 

large enough. Finally, using the induction hypothesis Bt-i{c) for each term of the form {b l , V) 
(noting that i, r < t — 1) and Corollary [2] we have 



1 *=i 

lim<b',& r > a = I™ 7 <Z 



i=0 

=' I™ l(qlq r )=' lim ^(g*,g r )- 

The last line uses the definition of and _L q r . 
For the case of r = s = t, similarly, we have 

t-i 

<^>i6 M = E /w^') + ( p M t ^i> p M^i) +0(1)- 

The contribution of other terms is o(l) because: 

— (Pi It Aq^M t d t {l)) = (Aq^P^Mtdtil)) = 0. 

— {Yli=o fob 1 , Mtdt(l)) = °(1)) us mg Corollary [2] and induction hypothesis Bt-i(d) for = 

— (X^i=o = °(1) follows from Lemma [3] and Corollary [2j 

The arguments at the last two points are completely analogous to the one carried out in the 
case s = t, r < t above. 

Now, using Lemma [3J 

lim {P^Aq^P^Aqi) = lim Ulq^Aqi) - {PmM^PmMi 



lim 

n— >oo 



(1) 
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Hence, from the induction hypothesis Bt-i(c), 

t-i 



]im(b t ,b t )\ 6tit = lim T A/3. 

i,j=0 



■J 5 + llm x 

n-too 



= lim 

n— >oo o 

lim ^ 



+ lim%ll 

n— >oo 



(e) Conditioning on (3^t and using Lemma 1141 (proved at point (a) above), almost surely, 



i=l i=l r=0 i=l r=0 i=l 

for some constant C = C(£,t) < oo. We will bound each of the above summands. 
— The term n _1 X^i=i(X)r=o firbl) 2e is finite since we can write 



1 n t— 1 / *~ 1 i n 



= 1 r=0 



n 

,r=0 i=l 



use Corollary [2] and induction hypothesis Bt-i(e) for each of n 1 X}iLi(&i) 2 ^- 
For the term n _1 ^ILid/ 7 ^]*) 2 ^ we use 

K) 2 = ^,o) 2 = o((6D 2 + <o + 5(0, 0) 2 )) , 

that follows from the Lipschitz assumption on g r . Thus, 



\2l: 



(3.44) 



-t I V I ..II -| f b 

i E(»D M = o - E(^) 2£ + -E < + 5(0, o) 5 

i=l \ i=l i=l / 

which has a finite limit almost surely, using the induction hypothesis Bt-i(e) and the 
assumption on w. 

The term n~ l ^I=i([^A/ t ^ l gl]*) 2 ^ can De written as 



1 n -t n i n 

- Ea p MM]i) 2£ = o(- ^([Mi^i + o(- Eci^Mii 

n * — ' \n L — ' / \n z — ' 



8=1 



1=1 



\2f 



i=l 



Now, n~ l ^ILi([^g±]«) 2 ^ nas a finite limit using the same proof as in Bo(b) and the fact 
that ]im n - >00 {q ± ,q 1 ) < lim n _ s . 00 (q*, q l ) < oo almost surely. 

Finally, for n _1 ^^lG-^Mt^Lg^Ji) 2 ^ us i n S Lemma[3]and Corollary [2] we can write 



PmM± = M t 



~M?M t ~ 


-l 


rZo||m ||||gi|| 




z t 


n 




ny/n 







,t-lllll J 



nvn 



1 * _1 

— — / C r TYl Z T . 
v r=0 
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where Zq, . . . , Z t -\ are iid with distribution N(0, 1). and co, . . . , c r are allmost surely 
bounded for all N large enough. Therefore, almost surely, 



i=l i=l v r=0 

r=0 i=l v 



2/ 
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n 

r =0 i=l 



Now each term is finite using the same argument as in Eq. (|3.44p . 
(b) Using part (a) we can write 



t-i 



r=0 



,Wi 



u i i ■ ■ ■ i u i i 



b° b 1 ' 1 



T=0 



Similar to the proof of Ho(b) we can drop the error term MfOt(l). Indeed, defining 

t-i 

YsPrF + Aqi + MtOtil) 

r=0 
t-1 

by the pseudo-Lipschitz assumption 

\<Pb{ai) -Mci)\ < L^l + max(\\a i \\ k -\\\c i \\ k - 1 )^^rh r i \o 1 (l). 

r=0 

Therefore, using Cauchy-Schwartz inequality twice, we have 

« ||_.||2fc-2 V n \\r-\\ 2k - 2 l i i 

L[max(^ & , ^ l=1 )] 5 [J2tkrh r ,fh r )] . 

(3.45) 



t-l 



i=l 



< 



i=l 



r=0 



Also note that 



_. n t 1 n i n 

- E imi* z (< + E 1 E^) M + 1 ■ 

n ^-^ ^-^ n L — ' n z — ' ' 



n 

r=0 8=1 



which is finite almost surely using the induction hypothesis Bt{e) proved above and the as- 
sumption on w. The term n _1 ^27=1 ll c «ll 2 ^ ^ s bounded almost surely since 



t-i -. n 



lb /-y I L L X-.Hl 

- E im*^ - E ikf + C H - £(™ r ) 2 ^u) 

i=l i=l r=0 i=l 

„ n f— 1 n 



i=l 



n 

r=0 i=l 
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where the last inequality follows from the fact that [M*Mt/n] has almost surely a non-singular 
limit as N — > oo, as proved in point (g) above. Finally, for r < t—1, each term (1/n) Yl7=i(' l ™ r ) 2i 
is bounded using the induction hypothesis Bt-i(e), and the argument in Eq. (|3.44|) . 

Hence for any fixed t, (|3.45p vanishes almost surely when n goes to oo. 

Now given, b°, . . . , fo' -1 , consider the random variables 

/ t-l \ 



bl...,b t r\Y J PA + {Aq 



r=0 



and Xi tn = Xi tn — K^{Xi tn }. Proceeding as in Step 1, and using the pseudo-Lipschitz property 
of </>, it is easy to check the conditions of Theorem [3j In particular, a similar inequality to 
(|3.34p can be obtained here as well and then the condition of Theorem [3] follows using Lemma 
[2] and induction hypothesis B r (e) on the empirical moment bounds for V for r = 0, . . . , t — 1 
and for w. We therefore get 



t-i 



i=l 



r=Q 



lim -J2 bt^lYtPrF + Aqi].,z 

n— >oo n z — ' V ^ — ' J * 

=0 



t-i 



r=0 



0. (3.46) 



Note that [-Agjjj is a gaussian random variable with variance ||<7jJ| 2 /n. Further ||9^|| 2 /n 
converges to a finite limit 7| almost surely as N — > 00. Indeed ||<7jJ| 2 /iV = ||g'|| 2 /iV — 
1 1^11 \\ 2 /N. By induction hypthesis 1it(b) applied to the pseudo-Lipshitz function ^(/i*, xo,j) = 
ft{h\, xo,j) 2 , H^ll 2 /^ = (ft{h l , xo), ft{h l , x )) converges to a finite limit. Further ||<7|| || 2 /iV = 

E*~s=o PrPr{q r , Q s ) also converges since the products (q r , q s ) do and the coefficients /3 r , r < t — 1 
converge by Corollary [2J 

Hence we can use induction hypothesis Bt-i(b) and Corollary [2] for 



t-i 



it ■ ■ 



r=0 

where Z is an independent N(0, 1) random variable to show 
EILi - , ^ [E*=o W r + Aqi] ,Wi)} 



lim 

n— ¥00 



n 



t-l 



E 



Ez{4>b{croZo, . . . a^Zt-i,^ Pr<J r Z r + 7t Z, w) } . (3.47) 



r=0 



Note that E*=o fir&rZ r + Jt Z is gaussian. All that we need, is to show that the variance of this 
gaussian is of. But using a combination of (|3.46p and (I3.47h for the pseudo-Lipschitz function 



t-i 2 

lim (&*, 6*) =' e{ ( ]T A-<7 P Z r + 7t Z) } • 



(3.48) 



r=0 
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On the other hand in part (c) we proved lim n ^ 00 (6*, b l ) a = lim^-^oo^ 1 (f(h t ,xo),f(h t ,xo)). By- 
induction hypothesis Htib) for the pseudo-Lipschitz function (ph(yo, ■ ■ ■ , yt,£o,i) = /(yt> £o,i) 2 
we get lim n ^ oo 5~ 1 {f(h t ,x ),f{h t ,x )} a = S^E {f(r t -iZ, X ) 2 }. So by definition ([33]) . both 
sides of (|3.48p are equal to of . 

(ci) In a manner very similar to the proof of Bo(d), using part (b) for the pseudo-Lipschitz function 
cp b : R t+2 -> E that is given by 4>b(yo, ■■■ ,Ut, Wi) = Vt^{y s ,Wi) we can obtain 

lim {b^ipib^w)) a =E\a t Z t ip(a s Z s ,W)\ , 

for jointly gaussian Zt, Z s with distribution N(0, 1). Using Lemma [U this is almost surely equal 
to Cov(<7tZt, a s Z s )K{ip' (a s Z s , W)}. By another application of part (b) for 4>b(yo, ■ ■ ■ , Vt, Wi) = 
y s yt transforms Cov(atZ t , cr s Z s ) to lim n ^. 00 (6*, b s ). Similar to Bo(d) we can use Lemma [6] to 
transform E{ip' (a s Z s ,W)} to ]xm n -^ 00 (tf/(b t , w)) almost surely. This finishes the proof of (d). 



3.9.4 Step 4: H t +i 

Due to symmetry, proof of this step is very similar to the proof of step 3 and we present only some 
differences. 

(g) This part is very similar to the one of Bt{g). 

(a) To prove Eq. (|3.14p we use Lemma [TITa) as for Bt(a) to obtain 

t-i 

h t+1 \e f+1 , f =Y, a ^ +1 + A * m ± ~ Qt+iiQt+xQt+iyQl+^mi + Q t d t (l) . 

i=0 

Now, using Lemma[3^c), asn,iV-y oo, 

Q t+ i(Q*t +1 Qt + ir l Q*t +l A*mi = Qt+io t (l) 
which finishes the proof since Qt+iOt(l) + Qfdi(l) = Q t +\Ot{\). 

(c) For r, s < t we can use induction hypothesis. For s = t,r < t, very similar to the proof of 
B t (a), 

t-i t-i 

<^ + \& r+1 >|6 t+1 , t =E a *<^ 

i=0 i=0 

Now, by induction hypothesis ~Ht(d), for </? = /, each term (q l ,h r+1 ) has a finite limit. Thus, 

t-i 

lim £ (l)(<f,^ +1 > a = 0. 

N— >oo * — ' 

i=0 

We can use induction hypothesis H r +i(c) or 'Hi(c) for each term of the form (h l , h r+1 ) and use 
Lemma [3] for (A*m^_, Pq t+i h r+1 ) to obtain 

t-i 

lim (h t+1 ,h r+1 ) =• lim y« i (m ! ,m r ) 

i=0 

a =' lim (m\,m T ) ==' lim {m t ,m r ), 
/V— >oo " N— >oo 
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Where the last line uses the definition of on and _L m r . 
For the case of r = s = t, we have 

t-i 

(h t+ \h t+1 )\ et+ht = £ + (PQ t+1 A*mi,P£ t+i A*mi) +o(l). 

i,j=0 

Note that we used similar argument as in proof of B(c) to show the contribution of all products 
of the form (Q t o t {l), •) and (P^ t+i A*m\, h i+1 ) a.s. tend to 0. Now, using induction hypothesis 
and Lemma [3] 



lim (h t+1 , h t+1 )\e t+ht =' lim V] aiaj(m\ m?) + lim -r^llm* " 



N—too 



a.s. 



i,i=0 

lim (m||,m|i) + lim (m^_,mj_) 



n— >oo 

nm (m , m ). 

n— >oo 



(e) This part is very similar to Bt{e). 

(/) Using %t{o) and Lemmata) we have almost surely 

rrw t mi On *~ 1 
Z||mj||||(7 " 



lim </i m ,g°> = lim " ™ " +V lim ai(h i+ \ q°). 

v i=0 

But this limit is almost surely, using the induction hypothesis T~L r (e) for r < t and Bt{c). 
(b) Using part (a) we can write 

"t-i 



^(/ij 1 , . . . ,hl +1 ,x 0j i)\e t+1 , t = 4>h I hf, ...,h 



ut 
i - 



]T a r h r+1 + A*mi + Qt+io t+ i{l) 



r=0 



Similar to proof of Bt{b) we can drop the error term Qt+iOt+i(l). Now given, /i 1 ,...,/?,', 
consider the random variables 



t-i 



r=0 



and X it N = Xi^— E^{Xi^}. Proceeding as in Step 2, and using the pseudo-Lipschitz property 
of 4>h j it is easy to check the conditions of Theorem [3l We therefore get 



1 N ( t_1 



i=l 



r=0 



E J h(hh . . . , hi [ «,6 r+1 + I*mi] ., x 0iJ 

^ r=0 



0. (3.49) 
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Note that L4*m^Jj is a gaussian random variable with variance ||m^_|| 2 /n. Hence we can use 
induction hypothesis ~Ht(b) for 



f * _1 II * II 

$ h (hl hlx 0<l ) = E Z \ cf> h (h\, . . . , h\, a rK +1 + ^ 

where Z is an independent N(0, 1) random variable, to show 

Y:ti E A {<t>h (hj,..., hi [E*=q u r b r+1 + i*m' ± ] . , s ii0 ) } 



lim 

Af-s>oo N 

( / t-1 



=■ E E; 



t Z , . . . Tt-iZt-i, y~] a r T r Z r + um -^ Z ,X ] \ . (3.50) 



Note that Y^r=o a r T rZ r + n 1//2 ||m^_||Z is gaussian. All that we need, is to show that the 
variance of this gaussian is r 2 . But using combination of (|3.49p and (|3.50p for the pseudo- 
Lipschitz function 4> h (yo, ■ ■ ■ ,Vt, %o,i) = Vt, 



\m i \\Z 



lim (h t+1 ,h t+1 ) = E V a r r r Z r + ^ . (3.51) 

On the other hand in part (c) we proved limjv_ 5 . 00 (/i' +1 , h t+1 ) a = limN^ O0 (g t (b t ,w),gt(b t ,w)). 

By the induction hypothesis £>j(6) for the pseudo-Lipschitz function <pb(yo, - - - j Uti w) = 9t(yt,w) 2 
we get lim n ^. 00 (g t (b t , w), gt{b l , w)) a = K{g t (atZ, W) 2 }. So by the definition ()1.4|) . both sides of 
(13.51 H are equal to r 2 . 

(d) This is very similar to the proof of Bt(d). For the pseudo-Lipschitz function (p^ : R* +2 — > K 
that is given by <^(yi, . . . ,yt+i,x 0ji ) = y t+ np(y s+ i,x Qti ) we can use part (a) to obtain 

lim (h t+1 ,<p(b s+1 ,x )) = E{r t Z t ip(r s Z s ,X )} , 

N^rOO 

for jointly gaussian Z t , Z s with distribution N(0, 1). Using Lemma SI this is almost surely equal 
to Cov(TtZ t , T s Z s )E{(p'(T s Z s , Xq)}. And another application of part (6) for (f>h(yi, ■ ■ ■ , J/t+i, #i,o) = 
y s +iyt+i transforms Cov(T t Z t ,T s Z s ) to lim7v_ s . 00 (/i t+1 , Similar to %\(d) using Lemma[6j 

E{y3'(r s Z s ,Xo)} can be transformed to limAr_ > . 00 ((//(/i* +1 , xq)) almost surely. This finishes the 
proof of (d). 

3.10 Proof of Corollary Q] 

First notice that the statement to be proved is equivalent to the following claim. The joint distribution 
of (^Vl), • • • , x o,J(i)j ■ • • ) x o,J(f))> f or J(X)i • • • ' ^ [^Tl uniformly random subset of distinct 
indices, converges weakly to to the distribution of (Xi, . . . X£,Xq^, . . . ,Xq^). By general theory of 
weak convergence, it is therefore sufficient to check Eq. (I1.7P for functions of the form 

Tp{xi, ...,x e ,yi,...,y e )= ipi(xi,yi) • • • ipe(x e , y e ) , (3.52) 
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for ipi : M? — > R Lipschitz and bounded. This case follows immediately from Theorem Q] once we 
notice that 

t . JV 

Eil>(xj(i), ■ ■ ■ ,Xj(i),xo,J(i), ■ ■ ■ ,%o,J(£)) = n {j^^2^( x i x o,i)) +0(1/N). (3.53) 



4 Symmetric Case 

Let k>2,G = A*+A with A £ R AfxAr , and assume that the entries of A are i.i.d. N(0, (27V)" 1 ). Also 
let / : K — > M be a Lipschitz continuous function. Start with m° and m 1 in W N where m° = Otvxi an d 
m 1 is a fixed deterministic vector in M. N with lim sup^y^^ N^ 1 ^2iLi( r mi,i) 2k ~ 2 < oo, and proceed 
by the following iteration 

h t+1 = Gm l - W"\ (4.1) 
m l = fitf) 

where X t = (/'(/i*)). Now let r 2 = limjv_ >00 (m 1 , mi), and define recursively for t > 1, 

r 2 +1 =E{[/(r t Z)] 2 } , (4.2) 

with Z~ N(0,1). 

Theorem 4. Let {A(N)}]\f be a sequence of matrices A £ M ArxAr indexed by N, with i.i.d. entries 
Aij ~ N(0, l/(2A r )~ 1 ). Then, for any pseudo-Lipschitz function tp : E — >• R of order k and all i £ N, 
almost surely 

1 N 

lim ^I>(M +1 ) =E[^(/(r t Z))] . (4.3) 
i=l 

Note 4. This theorem was proved by Bolthausen in the case f(x) = tanh(/3x + /i) and (m^m 1 ) = t 2 , 
for t"1 the fixed point of the recursion (|4.2p . The general proof is very similar to the one of Theorem 
0, and exploits the same conditioning trick. We omit it to avoid repetitions. 

When we are calculating h t+ , all values h , . . . , h* and hence m , . . . jin 1 are known to us. Denote 
the cr-algebra generated by all of these random variables by itf. Moreover, use the following compact 
formulation for (|4.ip . 

[h 2 \h 3 + A 2 ™ 1 ] ■■■\h t + A*-W~ 2 ] = G [m 1 \ . . . |m* _1 ], 



Yt-i M t -i 
The analogue of Lemma [T] is the following. 

Lemma 15. Let {A(N)}n be a sequence of random matrices as in Theorem^ Then the following 
hold for allt £ N 



t-1 

h t+l W = Y, a ^ +1 + 6m i + Mt-io t (l) , (4.4) 
i=l 
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where G is an independent copy of G and coefficients an satisfy m| = Yli=i a i ml - The matrix 
Mt is such that its columns form an orthogonal basis for the column space of Mt and M^Mt = 
n\txt- Recall that, ot(l) £ R* is a finite dimensional random vector that converges to almost 
surely as N — >■ oo. 



(6) For any pseudo-Lipschitz function cj> : R* — > R of order k 

N 



1 - 

lim -J2Hh 2 i ,...,ht 1 )^E[cf>(T 1 Z 1 ,...,T t Z t )] (4.5) 



i=i 

where Z\,...,Zt have N(0, 1) distribution. 

(c) For all 1 < r,s < t the following equations hold and all limits exist, are bounded and have 
degenerate distribution (i.e. they are constant random variables): 

lim (h r+1 ,h s+1 ) a = lim (m r ,m s ) (4.6) 

(d) For all 1 < r,s < t, and for any Lipschitz continuous function ip, the following equations 
hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant 
random variables): 

lim (h r+1 ,ip(h s+1 )) =■ lim (h r+1 ,h s+1 )(v'(h s+1 )) (4.7) 

(e) For I = k — 1, almost surely limjv->oo(^i +1 ) 2 ^ < 

(/) For all < r < t the following limit exists and there are positive constants p r (independent of 
N) such that almost surely 

lim (m r ± ,m r ± ) > p r . (4.8) 
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A AMP algorithm: An heuristic derivation 

In this appendix we present an heuristic derivation of the AMP iteration (jl.ip starting from the 
standard message passing formulation (jl.2j) . Let us stress that such derivation is not relevant for 
the proof of our Theorem [TJ Our objective is to help the reader develop an intuitive understanding 
of the AMP iteration. For further discussion of the connection with belief propagation we refer to 
[DMMlOal iDMMlObj . 
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Let us rewrite the message passing iteration for greater convenience of the reader 

z a->i = Ua ~ A aj x j^ a , (A-l) 

je[N]\i 



th [ Yl A bi z Ui ) • (A-2) 
1 6e[n]\a 



Notice that on the right-hand side of both equations the messages appears in sums of Q(N) terms. 
Consider for instance the messages {^-^i} ie[N] f° r a fixed node a E [n]. These depend on i 6 [N] 
only because the excluded term changes. It is therefore natural to guess that zi ,< = 2* + (^(A^ 1 / 2 ) 
and x*_> a = x- + 0(n -1 / 2 ), where z* only depends on the index a (and not on i), and x* only depends 
on i (and not on a). 

A naive approximation would consist in neglecting the 0(N~ 1 ^ 2 ) correction but this turns out 
to produce a non- vanishing error in the large- N limit. We instead set 

z a->i = z a + ^ z a^i ' x i->-a = x i + ^ x i^-a • 

Substituting in Eq. (|A.ip . we get 

z a + ^ 2 a-H = Va — Axj + fej_>. a ) + A a i{x\ + (fccf-^) , 

ie[JV] 

+ fe*+a = th A bi(4 + SzUi) - A ai (zi + 

\66[n] 

We will now drop the terms that are negligible without writing explicitly the error terms. First of all 
notice that single terms of the type Aaidz^^ are of order 1/N and can be safely neglected. Indeed 
5z a ^i = 0(N~ 1 / 2 ) by our anzatz, and A ai = 0(N~ 1 / 2 ) by definition. We get 



z a + $ z a-H — Va ^2 A aj( xt j + &E*-_» a ) + A a 




+ Sxl+\ = TH [ Ab ^ Z b + 5z b^ ~ A ™ Z * ■ 

y b€[n] J 

We next expand the second equation to linear order in Sx\__^ a and fc*^: 

z a + ^a-H = Ua — ^2 A aj( xt j + <5Xj_s. a ) + A a ix\ , 

ie[JV] 

\66[n] / \66[n] 



Notice that the last term on the right hand side of the first equation is the only one dependent on i, 
and we can therefore identify this term with <5z*_>j. We obtain the decomposition 

z a = Ua — Y2 A aj( xt j + &Ej-» a ) J (A-3) 

3 em 

Sz^ = A ai x\ . (A.4) 
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Analogously for the second equation we get 



< +1 = Vt [ £yM4 + fcU) | , (A.5) 
= I E + <*4-i) ) Auz* a . (A.6) 

\6e[n] 



Substituting Eq. (|A.4|) in Eq. (|A.5[) to eliminate dz^^ we get 



^ = Vt 



E^+E^l • (A-7) 

,6e[n] 6e[n] 



and using the normalization of A, we get E&e[n] ^li ~~ ^ ^> whence 

x *+! = ?7t (x t + >l*z*). (A.8) 
Analogously substituting Eq. (|A.6|) in ()A.3p . we get 

£ = Va~ E A V X *> + E 4^0*5 + 4 * A>« • (A-9) 

je[7V] jg[JV] 

Again, using the law of large numbers and the normalization of A, we get 

E Al^x) + (A*z t )j) » i E ^(4 + + (^A0> . (A.10) 

je[N] je[N] 

whence substituting in (|A.9p . we obtain the second equation in This finishes our derivation. 

B Strong law of large number for triangular arrays 

In this section we show how Theorem [3] can be obtained from Theorem 2.1 of Hu and Taylor from 
[HT97| . Define a n = n, p = 2, and ip(t) = t 2+e . It is clear that ip satisfies condition (2.1) from 
[HT97j . Next, condition n^ 1 Y!i=i^\ x n,i? +e < en 8 ' 2 yields 

kk *(°») 

oo 

< C 



n 2+Q 
n=l i=l 



oo . 



1 n 

n=l 
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Therefore condition (2.3) from |HT97] also holds. Finally, for any positive integer k, using condition 
n" 1 53iLi E|X ni j| 2+e < cn e l 2 and a generalized mean inequality 



E(E^) = E E s 



» 2 



n=l \i=l x " ' J n=l \i=l 

[E\X n 



2 \ 2k 



EE 



< 

n=l \ «=1 



n 2 



71 = 1 \i = l / 
OO ^ 

The last inequality uses g < 1 which leads to 2fc(l — £>/2) > 1. Hence, condition (2.4) of |HT97] 
satisfies as well. Therefore n _1 Y^=i X n ,i converges to almost surely. 

C Proof of Lemma [2] 

Define f(/3) = jj log (Y%=i e^ logUi ) . Lemma[2]is equivalent to show that /(l + e) < /(l). We prove 
that / is a decreasing function for all f3 > 0. Note that 



MS' 



/3 2 ; /3 Y^e/ 31 ^ 

iog(fy i0 ^ ) +p\o gUl 



1 A 



where H(p) is the entropy of a probability distribution on {1, . . . ,n} with £>j = v-^n 6 ^ « i ' g Ms and is 
always non-negative. This finishes the proof. 

D Proof of probability and linear algebra lemmas 

In this Appendix we provide proofs of two probability lemmas stated in Section 13.71 
D.l Proof of Lemma [5] 

Note that by definition of empirical measure, N" 1 ^2 i=i VK^i) = ^Pvi^i^)}- The P ro °f uses a 
truncation technique. For a positive integer B define ipB by 

ip(x) \rj){x)\ < B 
tPb(x) = { B ^{x) > B 
-B ij}{x) < -B 
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and write ip(x) = i/jb(%) + ^b{x). Since p v converges weakly to py, for the bounded continuous 
function i/jb(x), we have 

lim E Pv & B (V)} = E PV {^ B (V)}. (D.l) 

On the other hand, since i/j is pseudo-Lipschitz with order k we have 5; L(l + \x\ k ) for \x\ > 1. 

Therefore for large enough B, 

\Mx)\ < L{1 + \x\ k )l m>B} < L(l + |x| fc )I {|a;|fc>f 

From this we obtain 
E pv {^ B (V)}-lhn sup E Pv{N) {L(l + \V\ k )I mk> B_ 1} } 

iV— i>oo L 

< lim inf Ep V{N) MV)} < lim sup E p {^(V)} < 

E pv {<MF)} + lim sup E^ (iV) {L(l + |y| fc )I {|y|fe> B_ 1} }. 

Now, by assumption lim/v^, Ej3 ij(iV) {|y| fc } = E Pv {|y| fe } we can write \V\ k = \V\ k I {lv \ k>B/L _ 1} + 
|V^| fc I||y|fc <B / L _ 1 } and use the weak convergence of p v (N) to py to get 

^ln~nE^ (JV) {L(l + \V\ k )I {lvlk <s = E pv {L(l + |y| fc )I {|y|fc < f 

Therefore 

lim sup E p {L(l + \V\ k )I mk> B_ 1} }= lim E p {L(l + \V\ k )I mk> B_ 1} } 

= E Pv {L(i + |y| fc )i {yfc>f _ 1} }. 

Hence, all we need to show is that E pv {L|y| fc I||y| fc> B _ x i } converges to as B — > oo. But this follows 
using the bounded k tb moment of V and the dominated convergence theorem, when applied to the 
sequence of functions L(l + \ V\ k )I^ v \k >B / L ^ < L(l + \ V\ k ), indexed by B. 

D.2 Proof of Lemma O 

Recall that by Skorokhod's theorem, there exists a probability space (O, J 7 , P) and a construction of 
the random variables {(X n , Y n )} n >i and (X,Y) on this space, such that letting 

A = {u;en : (x n H,y n (w)) -> (xH,y(w))}, 

be the event that (X n ,Y n ) converges to (X,Y), we have ¥(A) = 1. Let Cp C M. 2 be the domain on 
which F is continuously differ entiable. Since F is Lipschitz continuous, Cp has full Lebesgue measure. 
Since the probability distribution of (X, Y) is absolutely continuous with respect to Lebesgue, Cp 
has measure 1 under this measure. Hence if we let 

B = {weO: (X(u),Y(u)) G C F } , 

we have P(B) = 1. OnAnB, we also have F'(X n (w), y n (w)) -> y (a;)). 

Letting Z„H = F'(X n (w),y n (a;)) (if (X n (u), Y n (u)) # C F set Z n (u) = 0) and Z(w) = !»), 
we thus proved that 

P{ lim ZJuj) = Z(uj)\ = 1. 

n— s-oo 

Since -F is Lipschitz |Z n (u;)| < C, and hence the bounded convergence theorem implies E{Z n (oj)} — > 
E{Z(uj)} which proves our claim. 



39 



D.3 Proof of Lemma [8] 



Let us denote by Q the covariance of the gaussian vector Zx, . . . , Zt- The set of matrices Q satisfying 
the constraints with constants ci, . . . , q, K is compact. Hence if the thesis does not hold, there must 
exist a specific covariance matrix satisfying these constrains, and such that 

K{[£(Z t ,Y)} 2 } -u*C~ 1 u = 0. (D.2) 

Fix Q to be such a matrix, and let S £ M. txt be the matrix with entries Sij = K{£(Zi,Y)£(Zj ,Y)} . 
Then Eq. (|D.2j) implies that S is not invertible (by Schur complement formula). Therefore there 
exist non- vanishing constants a\ , . . . , at such that 

a x £{Z U Y) + o 2 £(Z 2 ,Y) + • • • + a t £[Z U Y) a = . (D.3) 

The function (zx, ■ ■ ■ , zt) *- > a\ £(z\, Y) + • • • + at £(zt, Y) is Lipschitz and non-constant. Hence 
there is a set A C W of positive Lebesgue measure such that it is non-vanishing on A. Therefore, 
A must have zero measure under the law of (Z±, . . . , Zt), i.e. A m i n (Q) = 0. This implies that there 
exists non- vanishing constants a' 1: . . . ,a' t such that 

oi Z x + a' 2 Z 2 + • • • + a' t Z t =' . 

If i* = max{i G {1, . . . , t} : a\^ 0}, this implies 

t*-l 

i=l 

which contradicts the assumption V&r(Zt t \Zx, ■ ■ ■ , Z i% -x) > 0. 



D.4 Proof of Lemma [9] 

We will prove the thesis by induction over t. The case t = 1 is trivial, and assume that the claim is 
true up for any (t — 1) vectors vx, ■ ■ ■ ,vt~x, with constant c' t _ 1 . Without loss of generality, we will 
assume ||i>j|| 2 /n < K for some constant K independent of n (increasing the norm of the v^s increases 

Amin(C)). 

Let V E R nx * be the matrix with columns vi,...,Vt- Then C = V*V/n. By Gram-Schmidt 
orthonormalization, we can construct A upper triangular, and U £ M. nxt orthonormal (i.e., with 
U*U = \ txt ) such that 

U = VA. 

It follows that 

A mi „(C) = - \ min (V*V) = - X^indA-^A- 1 ) = - XmaxiAA*)' 1 = - a max (A)~ 2 . (D.4) 
n n n n 

Defining to be the columns of U, Gram-Schmidt orthonormalization prescribes 

_ Vj - Pj-xiVj) 
\\Vi - Pi-l(Vi)\\ 
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Which implies Ay 



\vi -Pi-i^)!!" 1 < (cn)-V2 and 
1 



.4 



.1 1 



\\Vi - Pi-l(Vi 

We then have 

\Aji\ < (cn)- 1 / 2 A min (y i l 1 y i _i)- 1 (i - l)Kn < t(cn)- 1/2 {d^_ x n)- 1 Kn < c"n~ 1 / 2 . 
It follows that cx max (yl) < c w n -1 ' 2 (with d" depending on n) whence the thesis follows by Eq. (|D.4|) . 
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